This glossary was developed by the PDF Accessibility Liaison Working Group. This page remains under active development; more terms will be added as Techniques are released and based on feedback received. Questions or comments should be sent to firstname.lastname@example.org.
ActualText replaces content which is intended to be consumed as text but for which the correct text is not available.
ActualText is used, for example;
WARNING: As a consequence of this distinction, for files using PDF 1.7 or earlier, if replacement of the tag’s semantic meaning by the ActualText is not intended, the best practice is to place the ActualText on a Span tag or on a Span marked content sequence. This ensures that necessary tags (e.g., TD or Lbl) are available to AT. This is different from HTML, in which the ActualText concept does not exist.
In the PDF 2.0 context ActualText is simply additional information, and does not impact the semantics of the tag on which it is set, however, assistive technology can nonetheless choose to ignore the tag and simply represent the ActualText in its place.
Annotations are a commonly-used feature of PDF that provide users with the ability to add material to PDF pages. Annotations exist separately from page content, but for accessibility purposes, are considered as content.
The PDF specification defines different types of annotations covering many uses, for example, text markup, notes, links, watermarks, forms, 3D content, redaction markup and rich media.
An artifact is a decoration, pagination or layout related element in a PDF that is not intended by the author to be relevant for understanding the content, and is thus typically ignored by AT.
Content is marked as an artifact to enable software to distinguish it from real content. Examples of artifacts include, but are not limited to:
Assistive technology (AT) refers to any means, tool or technical solution which allows users with disabilities or impairments to interact with digital document content.
Examples of AT include:
Further information may be found on Wikipedia, among other sources.
For the extended description of tags, there are - similar to HTML - attributes that assign additional information to tags.
An example of an attribute is “ColSpan” which specifies the number of columns spanned by a given table cell.
In PDF, a technical distinction is made between attributes and properties of an object.
The author’s intent is the set of choices made by the content creator in order to communicate with the reader. These include, for example:
Block level tags (formally defined in the PDF specification, ISO 32000, as block level structure elements (BLSE)), provide meaning and organization for the overall layout of content on the page. Common examples of block level tags include P or Hn tags.Some tags can be either at the Grouping, Block, or Inline level, but only one level at a time.
Content order is the sequence in which objects are painted onto the page. In PDF, the purpose of content ordering is to achieve a correct and consistent result when rendered to the screen or a printer. The content order may or may not correspond to the logical content order.
"Graphics object" is a term of art in PDF technology, as defined by PDF 2.0. The specification defines five types of graphics objects: path objects, text objects, external objects (often used for images), inline images, and shading objects. Graphics objects are part of a document’s content and are divided into two classes: real content and artifacts.
PDF provides the following five types of graphics objects:
ISO 32000-2, clause 8.2
- A path object is an arbitrary shape made up of straight lines, rectangles, and cubic Bézier curves […]
- A text object consists of one or more character strings that identify sequences of glyphs to be painted […]
- An external object (XObject) is an object defined outside the content stream and referenced as a named resource (see 7.8.3, "Resource Dictionaries") [...]
- An inline image object uses a special syntax to express the data for a small image directly within the content stream […]
- A shading object describes a geometric shape whose color is an arbitrary function of position within the shape […]
Grouping tags enclose and organize block level tags, but do not directly enclose content.
Heading tags (H or Hn) are the PDF equivalent of HTML's h1-h6 tags. PDF permits two types of heading tags to structure content - H (in which the tag tree itself defines the heading level) or Hn (which self-identifies the heading level). PDF/UA does not allow use of both mechanisms in one document.
Further information may be found in the Matterhorn Protocol.
Inline level tags (formally defined as inline level structure elements (ILSE)) provide meaning and organization for content within block level tags.
See text recognition.
When text is invisible the characters are completely transparent. Accordingly, when positioned over a background layer, the background will be visible through the characters. In the case of PDF, Optical Character Recognition technology often places invisible text over a scanned image to allow for text selection and extraction, including for purposes of accessibility.
The logical content order (as distinct from the content order) determines the order in which real content is processed by some software, especially assistive technology. It is the unique and linear arrangement of real content elements so that information can flow logically from one unit to another.
Although the logical content order is mostly based on the visual order of the elements on the page, it can be set completely independently of the arrangement of the elements within the page layout. In the end, the logical content order is based on the intent by the author.
A particular logical content order may be correct or incorrect according to whether or not the order makes sense for understanding the document.
The main content of a document is the real content that makes up the primary flow of content. Examples that are usually not part of the main content include:
Content and the tag tree are encoded separately in PDF documents. A marked-content sequence identifies a segment of content for the purpose of connecting it to the document’s tags or to mark it as an artifact (and thus, exclude it from the tag tree).
Examples of content that can be enclosed in a marked-content sequence include (but are not limited to):
Each tag can be associated with one or more marked-content sequences. One of the requirements for PDF/UA-conforming documents is that all content is identified using marked-content sequences.
NOTE: PDF also supports “marked content points”, which are not utilized for accessibility purposes.
Meaningful content is a synonym of real content.
Natural language, (i.e. English, Spanish, German, etc.) is a human spoken language. Natural language has to be specified for documents following the PDF/UA standard. At least the default language in the document metadata must be specified by defining a natural language. If the document contains text that differs from the language in the document metadata, these must also be marked with the correct natural language.
Non-text content is any content that is not text content, for example pictures, lines or graphs.
See text recognition.
PDF for Universal Accessibility (PDF/UA), published by ISO as ISO 14289, defines the correct use of tags in PDF for accessibility purposes. PDF/UA-1 provides rules with respect to PDF 1.7 (ISO 32000-1) while PDF/UA-2 provides rules with respect to PDF 2.0 (ISO 32000-2). See the core PDF glossary.
An example of information that can be managed by properties is the language of the text in a specific marked content sequence.
Use of the Unicode “PUA” (Private Use Area) is an accessibility error, as it represents a case wherein the authoring application has referenced non-standard Unicode.
Real content are all elements that are intended by the author to be relevant for understanding the content.
Real content is one of the two classes of content, the other being artifacts.
Example: A paragraph in the main body of the document is real content.
Although the PDF specification defines a set of standard tags to enable consistent processing of semantics, PDF creators are not limited to this base set and may extend it through the use of custom tags. Role mapping enables consistent interpretation by assistive technology by assigning such custom tags to semantically appropriate standard PDF tags.
See the Tagged PDF Best Practice Guide for additional information.
“Semantically appropriate” tags accurately represent the meaning of the content they contain. In this context, the “meaning” includes the content’s type, its structure and the relationships between content elements otherwise communicated through visual appearance. Ensuring semantically appropriate tagging enables accurate, consistent communication of all aspects of the content to users in a non-visual way, including by assistive technology.
Clues to appropriate semantics include visual cues such as separation between paragraphs and the use of a larger, bold font to indicate a heading.
Some tags (“parent tags”) contain other tags (“child tags”). For example, the L tag contains one or more LI tags to make a list, and the LI tag in turn can contain an Lbl and LBody tag. These nested tags are known as “substructure”.
Of course, many other types of tags can contain substructures, for example, a Caption tag can contain P tags, or other tags, as substructure.
The tab order is the sequence in which a user can navigate between interactive elements in a PDF document, for example when using the Tab key. The tab order enables focus changes to the next interactive element (such as links or form fields) in a defined order. The tab order should follow the logical content order of the document.
PDF/UA requires that the tab order is set.
In order to understand the content of a table, it is necessary to understand the relationships between data. These relationships are established by the use of header cells (TH). It is possible to have row and/or column headers. For example, a column headed by a TH establishes the relationship of the data (tagged in TD tags) below it.
An important distinction: tables can have headers (TH) while documents can have headings (Hn).
Similar to tags in HTML, in PDF, tags (formally defined as “structure elements”) categorize the semantic intent of the content it encloses to enable appropriate representation within software, including assistive technology.
Examples of tags in PDF:
PDF allows for custom tags. Accessible PDF requires that custom tags are role mapped to PDF’s standard tags to ensure correct interpretation.
PDF/UA, a technical specification, uses “structure element” which is the technical term for tag.
The tag tree represents a hierarchy (formally defined as “structure hierarchy”) of the complete real content of a document, and communicates to software the author’s intent for all the real content on the page.
TECHNICAL NOTE: The “tag tree” concept is discussed in the PDF specification, PDF 2.0, using the term logical content order. Earlier PDF specifications used the term logical structure order.
Text content means every character in the document, e.g letters, digits, punctuation marks, special characters. To make a text content machine readable, it must include Unicode mappings.
Text recognition software allows retrieval of machine readable text from an image. Popular text recognition technologies include Optical Character Recognition (OCR) and Intelligent Character Recognition (ICR). The most common use case for text recognition in the PDF/UA context is an image intended to be consumed as text.
The accuracy of recognized text is critical to accessibility applications, and thus, users performing text recognition have an obligation to ensure that the final result is correct. The following image demonstrates how uncorrected text-recognition can impact the accessibility of content.
Unicode is a global information technology standard for the consistent identification of characters, e.g. letters, numbers, symbols and emojis. Correct Unicode mappings allow characters to be machine-readable such that assistive technology can interpret it.
Unicode provides a unique number for every character independently from software or language that is used. Some examples: