Help language development. Donate to The Perl Foundation

PDF::Tags zef:dwarring last updated on 2022-07-25

80ce83d8a1596c4a6b423398bdacbc5f1337809d/

[Raku PDF Project] / PDF::Tags

PDF-Tags-raku

A small DOM-like API for the creation of tagged PDF files for accessibility purposes.

This module enables PDF tagged content manipulation, construction, XPath queries and basic XML serialization.

See also PDF::Tags::Reader, which is designed to read content from existing tagged PDF files.

Synopsis

use PDF::Tags;
use PDF::Tags::Elem;

# PDF::API6
use PDF::API6;
use PDF::Annot;
use PDF::Destination :Fit;
use PDF::XObject::Image;
use PDF::XObject::Form;

my PDF::API6 $pdf .= new;
my PDF::Tags $tags .= create: :$pdf;
# create the document root
my PDF::Tags::Elem $doc = $tags.Document;

my $page = $pdf.add-page;
my $header-font = $page.core-font: :family<Helvetica>, :weight<bold>;
my $body-font = $page.core-font: :family<Helvetica>;

$pdf.add-page; # blank second page, as a target

$page.graphics: -> $gfx {

    $doc.Header1: $gfx, {
        .say('Marked Level 1 Header',
             :font($header-font),
             :font-size(15),
             :position[50, 120]);
    };

    $doc.Paragraph: $gfx, {
        .say('Marked paragraph text', :position[50, 100], :font($body-font), :font-size(12));
    };

    # add a marked image
    my PDF::XObject::Image $img .= open: "t/images/lightbulb.gif";
    $doc.Figure: $gfx, $img, :Alt('Incandescent apparatus');

    # XObject Form with marked content
    my PDF::XObject::Form $form = $page.xobject-form: :BBox[0, 0, 200, 50];
    my $form-frag = $doc.fragment;
    $form.text: {
        my $font-size = 12;
        .text-position = [10, 38];

        $form-frag.Header2: $_, {
            .say: "Tagged XObject header", :font($header-font), :$font-size;
        };

        $form-frag.Paragraph: $_, {
            .say: "Some sample tagged text", :font($body-font), :$font-size;
        };
    }

    # - render the form contained in $form-frag
    # - copy the fragment into the structure tree
    $doc.do: $gfx, $form-frag, :position[150, 70];
}

$pdf.save-as: "/tmp/synopsis.pdf"

Description

A tagged PDF contains additional markup information describing the logical document structure of PDF documents.

PDF tagging may assist PDF readers and other automated tools in reading PDF documents and locating content such as text and images.

This module provides a DOM like interface for creating and traversing PDF structure and content via tags. It also an XPath like search capability. It is designed for use in conjunction with PDF::Class or PDF::API6.

Standard Tags

Elements may be constructed using their Tag name or Mnemonic, as listed below. For example:

$root.P: $gfx, { .say('Marked paragraph text') };

Can also be written as:

$root.Paragraph: $gfx, { .say('Marked paragraph text') };

Or as:

$root.add-kid(:name<P>).mark: $gfx, { .say('Marked paragraph text') };

"Grouping" elements:

Tag Mnemonic Description
Document Whole document; must be used if there are multiple parts or articles
Part Large division of a document; may group smaller units of content together, such as Division, Article, or Section elements.
Art Article Self-contained body of text considered to be a single narrative.
Sect Section General container element type that is usually a component of a Part or Article element
Div Division Generic block element or group of element
BlockQuote A large portion of text referencing content from another source
Caption Description of a Figure or Table
TOC TableOfContents May be nested, and may be used for lists of figures, tables, etc.
TOCI TableOfContentsItem Table of contents (leaf) item
Index An index of keywords and topics, usually at the end of the document (text with accompanying Reference content)
NonStruct NonStructural non-structural grouping element (element itself not intended to be exported to other formats like HTML, but 'transparent' to its content which is processed normally)
Private Content only meaningful to the creator (element and its content not intended to be exported to other formats like HTML)

"Block" elements:

Mnemonic Tag Description
Tag Mnemonic Description
H Heading Nested section heading (not recommended)
H1 - H6 Heading1 - Heading6 The title or heading of a section within the text content
P Paragraph A distinct section of a piece of writing, usually dealing with a single theme
L List A group of similar items that are related to each other. Should include optional Caption, and list items
LI ListItem A Single list element. Should contain Lbl and/or LBody
Lbl Label Bullet, number, or "dictionary headword"
LBody ListBody Description of the item; may have nested lists or other blocks

"Table" elements:

Tag Mnemonic Description
Table Content arranged into rows and columns; should either contain TR, or THead, TBody and/or TFoot
TR TableRow A single row of cell elements within a table
TH TableHeader Description of column contents
TD TableData A cell element
THead TableHead A row of table headers
TBody TableBody Table body; may have more than one per table
TFoot TableFoot Table footer row group

"Inline" elements:

Tag Mnemonic Description
Span Generic inline content.
Quote Inline text referencing content from another source
Note End-note or footnote; may have a Lbl (see "block" elements)
Reference Content in a document that refers to other content (e.g. page number in an index)
BibEntry BibliographyEntry Text referring the user to source of cited text. May have a Lbl (see "block" elements)
Code Computer code
Link hyperlink; should contain a link annotation
Annot Annotation annotation (other than a link)
Ruby Chinese/Japanese pronunciation/explanation
RB RubyBaseText Ruby base text
RT RubyText Ruby annotation text
RP RubyPunctuation
Warichu Japanese/Chinese longer description
WT WarichuText
WP WarichuPunctuation

"Illustration" elements (should have Alt and/or ActualText set):

Tag Mnemonic Description
Figure An image or graphic that is referenced by the text
Formula A scientific or mathematical formula element
Form An editable PDF field used to complete a form

Non-structure tags:

Tag Mnemonic Description
Artifact Used to mark all content not part of the logical structure
ReversedChars Every string of text has characters in reverse order for technical reasons (due to how fonts work for right-to-left languages); strings may have spaces at the beginning or end to separate words, but may not have spaces in the middle

Classes in this Distribution

Advanced Topics

Form and Image XObjects

In the simple case, both PDF::XObject::Form's and PDF::XObject::Image's are inserted and externally tagged as an atomic graphical element, typically tagged as Figure or Form:

my PDF::XObject::Image $img .= open: "t/images/lightbulb.gif";

my $figure = $doc.Figure: $gfx, $img, :position[50, 70], :Alt("A light-bulb");

An PDF::XObject::Form may include marked content, that is copied into the tree each time the form is inserted. The technique is demonstrated below:

use PDF::Tags;
use PDF::Tags::Elem;
use PDF::Class;
use PDF::XObject::Form;
my PDF::Class $pdf .= new;
my PDF::Tags $tags .= create: :$pdf;
my PDF::Tags::Elem $doc = $tags.Document;

my PDF::Page $page = $pdf.add-page;
$page.graphics: -> $gfx {
   $doc.Header1: $gfx, {
        .say('Header text');
   }

    my PDF::XObject::Form $form = $page.xobject-form: :BBox[0, 0, 200, 50];
    my PDF::Tags::Elem $form-frag = $doc.fragment;

    $form.text: {
        my $font-size = 12;
        .text-position = [10, 38];
        $form-frag.Header2: $_, {
            .say: "Tagged XObject header";
        };
        my $p = $form-frag.Paragraph: $_, {
            .say: "Some sample tagged text";
        };
    }

    # multiple rendering of the form, and insertion of its structure tree
    $doc.do($gfx, $form-frag, :position[150, 70]);
    $doc.do($gfx, $form-frag, :position[150, 20]);
}

To insert an XObject Form that has marked content:

  1. Create a new fragment element.
  2. Create the Form XObject, marking content against the fragment
  3. The do method can then be used to both render and insert a copy of the fragment into the structure tree.

Links are usually contained in a block element, such as a Paragraph. If the link is internal, it should further be enclosed in a Reference element.

Furthermore, if the link is in flowing text, such as a paragraph, the mark method may be needed to mark preceding text, the link, and following text.

Please see examples/link.raku which demonstrates adding an tagged internal link to a PDF.

Graphics Content Tags

As a rule, all content doesn't have to form part of the structure tree, but should be tagged to meet accessibility guidelines.

This sometimes requires tagging of incidental graphics. PDF::Content has a tag() method for this. The content is be tagged, but does not appear in the content stream.

Some of the commonly used content tags are:

Artifact

Artifact content forms part of the visual display, but does not belong in the Structure Tree and is tagged using the PDF::Content tag method.

For example:

$gfx.tag: Artifact, {
    .say("Page $page-num", :$font, :position[ 250, 20 ]);
}

Clipped

A clipped region encompasses additional graphics that are being used as part of a clipping operation. The clipped area may include graphics that are part of the structure tree. For example:

use PDF::Class;
use PDF::Tags;
use PDF::Tags::Elem;
use PDF::Content::Tag :ContentTags, :ParagraphTags;

my PDF::Class $pdf .= new;
my PDF::Tags $tags .= create: :$pdf;
# create the document root
my PDF::Tags::Elem $doc = $tags.Document;

$pdf.add-page.graphics: {
    .tag: Clipped, {
        .Rectangle: 100, 100, 125, 20;
        .Clip;
        .EndPath;
        $doc.Paragraph: $_, {
            .say: 'Clip me', :position[98, 98];
        }
    }
}

The above example is setting up a clipping sequence. The clipped text is being inserted as a paragraph in the structure tree.

Span

This tag may be used in the structure tree, or at the content level to defined attributes of a graphics sequence. Its usage is similar to the XHTML span tag.

$gfx.tag: Span, :Lang<es-MX>, {
    .say('Hasta la vista', :position[50, 80]);
}

It can be used almost anywhere in the structure tree, or at the content level, as above.

Verification

The pdf-tag-dump.raku script from the PDF::Tags::Reader module can be used to view the logical content of PDF files as XML, for example:

$ pdf-tag-dump.raku /tmp/synopsis.pdf

Produces

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Document SYSTEM "http://pdf-raku.github.io/dtd/tagged-pdf.dtd">
<?xml-stylesheet type="text/css" href="https://pdf-raku.github.io/css/tagged-pdf.css"?>
<Document>
  <H1>
    Marked Level 1 Header
  </H1>
  <P>
    Marked paragraph text
  </P>
  <Figure BBox="0 0 19 19">
  </Figure>
  <Link href="#sample-annot"></Link>
  <Form BBox="150 70 350 120">
    <H2>
      Tagged XObject header
    </H2>
    <P>
      Some sample tagged text
    </P>
  </Form>
</Document>

The XML output from pdf-tag-dump.raku includes an external DtD for basic validation purposes.

For example, it can be piped to xmllint, from the libxml2 package, to check the structure of the tags:

$ pdf-tag-dump.raku my.pdf | xmllint --noout --valid -

See Also

Further Work

The PDF accessibility standard ISO 14289-1 cannot be distributed and needs to be purchased from ISO.