Glossary of accessibility terminology in PDF

This glossary was developed by the PDF Accessibility Liaison Working Group. This page remains under active development; more terms will be added as Techniques are released and based on feedback received. Questions or comments should be sent to pdf-accessibility-lwg@pdfa.org.

Contents

Glossary of accessibility terminology in PDF

ActualText

ActualText replaces content which is intended to be consumed as text but for which the correct text is not available.

ActualText is used, for example;

for non-text graphics objects representing text
for correcting OCR errors
for text without correct Unicode mapping.

When using ActualText on a tag, the type of tag (e.g., Figure, H2, etc.) may not be announced by Assistive Technology.

WARNING: As a consequence of this distinction, for files using PDF 1.7 or earlier, if replacement of the tag’s semantic meaning by the ActualText is not intended, the best practice is to place the ActualText on a Span tag or on a Span marked content sequence. This ensures that necessary tags (e.g., TD or Lbl) are available to AT. This is different from HTML, in which the ActualText concept does not exist.

In the PDF 2.0 context ActualText is simply additional information, and does not impact the semantics of the tag on which it is set, however, assistive technology can nonetheless choose to ignore the tag and simply represent the ActualText in its place.

Alt text

Alt text (which is short for “alternative text”) is a property of a tag that adds a short textual description of real content typically applied to non-text content.

For example, Alt text is applied to a Figure tag to provide an accessible alternative to an image.

Alternative / alternate description

PDF/UA requires an alternative description for Link annotations. The text describes the target web page or file. In many cases, the title of the web page or file is a good alternate description.

NOTE: Both “alternative” and “alternate” are commonly used terms; both refer (in PDF) to the value of a tag’s “Alt” property.

Annotation

Annotations are a commonly-used feature of PDF that provide users with the ability to add material to PDF pages. Annotations exist separately from page content, but for accessibility purposes, are considered as content.

The PDF specification defines different types of annotations covering many uses, for example, text markup, notes, links, watermarks, forms, 3D content, redaction markup and rich media.

Artifact

An artifact is a decoration, pagination or layout related element in a PDF that is not intended by the author to be relevant for understanding the content, and is thus typically ignored by AT.

Content is marked as an artifact to enable software to distinguish it from real content. Examples of artifacts include, but are not limited to:

decorative lines,
grid lines of a table,
background images,
page numbers,
running headers and footers.

Assistive technology

Assistive technology (AT) refers to any means, tool or technical solution which allows users with disabilities or impairments to interact with digital document content.

Examples of AT include:

Screen reader
Screen magnifier
Braille display
Speech recognition for commands to navigate in a document

Further information may be found on Wikipedia and WebAIM, among other sources.

Attribute

For the extended description of tags, there are - similar to HTML - attributes that assign additional information to tags.

An example of an attribute is “ColSpan” which specifies the number of columns spanned by a given table cell.

In PDF, a technical distinction is made between attributes and properties of an object.

Author’s intent

The author’s intent is the set of choices made by a content creator in order to communicate with the reader. These include, for example:

Organization and ordering of the real content (structure).
Content (words and images).
Layout and appearance on the page, including pagination and other artifacts.

To ensure their meaning is conveyed to all users, regardless of disability, the author is fundamentally responsible for ensuring that headings are used in a logical manner, that the use of colour and contrast does not render the content inaccessible, and that images include alternative text.

Document remediators (that is, users who adjust PDF files to enhance or assure accessibility) are also considered “authors” to the extent that they may make various changes to the document, including (but not limited to):

Changing the ordering of tags to reflect the document’s organization.
Changing tag type to reflect the content’s semantics.
Adding or changing alternative text for images.
Adding or changing links.
Adding or changing associated files providing accessible alternatives.

Block level tag

Block level tags (formally defined in the PDF specification, ISO 32000, as block level structure elements (BLSE)), provide meaning and organization for the overall layout of content on the page. Common examples of block level tags include P or Figure tags.Some tags can be either at the Grouping, Block, or Inline level, but only one level at a time.

Caption tag

Content that captions (describes) other content is enclosed within a Caption tag.

Container

Informal synonym for marked content sequence.

Content

Content is visible on PDF pages and can include, for example, graphics objects, annotations, or form fields. All content is classified either as real content or as artifacts.

Content order

The sequence in which objects are painted onto the page. In PDF, the purpose of content ordering is to achieve a correct and consistent result when rendered to the screen or a printer. The content order may or may not correspond to the logical content order.

Fundamental Technique

Describes a procedure on how to achieve a PDF/UA requirement for every PDF/UA file, in contrast to Specific Techniques.

Graphics object

"Graphics object" is a term of art in PDF technology, as defined by PDF 2.0. The specification defines five types of graphics objects: path objects, text objects, external objects (often used for images), inline images, and shading objects. Graphics objects are part of a document’s content and are divided into two classes: real content and artifacts.

PDF provides the following five types of graphics objects:

A path object is an arbitrary shape made up of straight lines, rectangles, and cubic Bézier curves […]

A text object consists of one or more character strings that identify sequences of glyphs to be painted […]

An external object (XObject) is an object defined outside the content stream and referenced as a named resource (see 7.8.3, "Resource Dictionaries") [...]

An inline image object uses a special syntax to express the data for a small image directly within the content stream […]

A shading object describes a geometric shape whose color is an arbitrary function of position within the shape […]

ISO 32000-2, clause 8.2

Grouping tag

Grouping tags enclose and organize block level tags, but do not directly enclose content.

Heading tag

Heading tags (H or Hn) are the PDF equivalent of HTML's h1-h6 tags. PDF permits two types of heading tags to structure content:

H (in which the tag tree itself defines the heading level), or
Hn (which self-identifies the heading level).

PDF/UA-1 does not allow use of both mechanisms in one document. Further information may be found in the Matterhorn Protocol.

Illustrative content

Illustrative content is real content that is also non-text content.

Inline tag

Inline level tags (formally defined as inline level structure elements (ILSE)) provide meaning and organization for content within block level tags.

Intelligent Character Recognition (ICR)

See text recognition.

Invisible text

When text is invisible the characters are completely transparent. Accordingly, when positioned over a background layer, the background will be visible through the characters. In the case of PDF, Optical Character Recognition technology often places invisible text over a scanned image to allow for text selection and extraction, including for purposes of accessibility.

ListNumbering attribute

The ListNumbering attribute is set on the L tag of a list to indicate the types of labels the list uses. It is required for ordered lists and optional for unordered lists.

The possible values of the ListNumbering attribute are defined in ISO 32000-2, table 382.

Logical content order

The logical content order (as distinct from the content order) determines the order in which real content is processed by some software, especially assistive technology. It is the unique and linear arrangement of real content elements so that information can flow logically from one unit to another following the author's intent.

Although the logical content order is mostly based on the visual order of the elements on the page, it can be set completely independently of the arrangement of the elements within the page layout. In the end, the logical content order is based on the intent by the author.

PDF's logical content order is set by the order and hierarchy of the tags in the tag tree, and then by the order of the Marked Content Sequences within a tag.

A particular logical content order may be correct or incorrect according to whether or not the order makes sense for understanding the document.

Main content

The main content of a document is the real content that makes up the primary flow of content. Examples that are usually not part of the main content include:

Footnotes / endnotes
Sidebars
Advertisements
Line numbers

Marked-content sequence

Content and the tag tree are encoded separately in PDF documents. A marked-content sequence identifies a segment of content for the purpose of connecting it to the document’s tags or to mark it as an artifact (and thus, exclude it from the tag tree).

Examples of content that can be enclosed in a marked-content sequence include (but are not limited to):

A single character
A few characters in a word
A line of text
An image
Table cell borders (marked as Artifact)
Rule line separating content from footnotes (marked as Artifact)

Each tag can be associated with one or more marked-content sequences. One of the requirements for PDF/UA-conforming documents is that all content is identified using marked-content sequences.

NOTE 1: Marked content sequences are sometimes called containers.
NOTE 2: PDF also supports “marked content points”, which are not utilized for accessibility purposes.

Meaningful content

Meaningful content is a synonym of real content.

Natural language

Natural language, (i.e. English, Spanish, German, etc.) is a human spoken language. Natural language must be specified for documents following PDF/UA. A default language must be specified by defining a natural language in the document metadata. If the document contains text that differs from the default language, such text must be marked with the correct natural language.

Non-text content

Non-text content is any content that is not text content, for example pictures, lines or graphs.

Optical Character Recognition (OCR)

See text recognition.

PDF/UA

PDF for Universal Accessibility (PDF/UA), published by ISO as ISO 14289, defines the correct use of tags in PDF for accessibility purposes. PDF/UA-1 provides rules with respect to PDF 1.7 (ISO 32000-1) while PDF/UA-2 provides rules with respect to PDF 2.0 (ISO 32000-2). See the core PDF glossary.

Properties

Properties (formally “property lists”) are additional information on marked content sequences. In contrast, attributes apply only to structure elements (tags).

An example of information that can be managed by properties is the language of the text in a specific marked content sequence.

PUA

The Unicode “PUA” (Private Use Area) is an accessibility error, as it represents a case wherein the authoring application has referenced non-standard Unicode.

Real content

Real content are all elements that are intended by the author to be relevant for understanding the content.

Real content is one of the two classes of content, the other being artifacts.

Example: A paragraph in the main body of the document is real content.

Regular table

A regular table is a table that has the same number of cells in each row after accounting for row and column spans.

Role mapping

Although the PDF specification defines a set of standard tags to enable consistent processing of semantics, PDF creators are not limited to this base set and may extend it through the use of custom tags. Role mapping enables consistent interpretation by assistive technology by assigning such custom tags to semantically appropriate standard PDF tags.

See the Tagged PDF Best Practice Guide for additional information.

Semantic

See real content.

Semantically appropriate

“Semantically appropriate” tags accurately represent the meaning of the content they contain. In this context, the “meaning” includes the content’s type, its structure and the relationships between content elements otherwise communicated through visual appearance. Ensuring semantically appropriate tagging enables accurate, consistent communication of all aspects of the content to users in a non-visual way, including by assistive technology.

Clues to appropriate semantics include visual cues such as separation between paragraphs and the use of a larger, bold font to indicate a heading.

Specific Technique

A Specific Technique describes a procedure on how to achieve a PDF/UA requirement that applies only to certain use cases or document elements, in contrast to Fundamental Techniques.

Structured, strongly

So-called “strongly structured” documents use the structure hierarchy to determine the level of headings. This method is not recommended for several reasons, and is no longer permitted in documents conforming to PDF/UA-2.

Structured, weakly

So-called “weakly structured” documents exclusively use the heading level of the tag (H1, H2, etc.) to determine the level of the heading. This is the only means of identifying headings permitted by PDF/UA-2.

Structure element

The formal term for tag as defined in the PDF specification, ISO 32000. Structure elements occur within the structure hierarchy (tag tree).

Structure hierarchy

The formal term for the organization of structure elements (tags) as defined in ISO 32000.

Substructure

Some tags (“parent tags”) contain other tags (“child tags”). For example, the L tag contains one or more LI tags to make a list, and the LI tag in turn can contain an Lbl and LBody tag. These nested tags are known as “substructure”.

Of course, many other types of tags can contain substructures, for example, a Caption tag can contain P tags, or other tags, as substructure.

Tab order

The tab order is the sequence in which a user can navigate between interactive elements in a PDF document, for example when using the Tab key. The tab order enables focus changes to the next interactive element (such as links or form fields) in a defined order. The tab order should follow the logical content order of the document.

PDF/UA requires that the tab order is set.

Table headers

In order to understand the content of a table, it is necessary to understand the relationships between data. These relationships are established by the use of header cells (TH). It is possible to have row and/or column headers. For example, a column headed by a TH establishes the relationship of the data (tagged in TD tags) below it.

An important distinction: tables can have headers (TH) while documents can have headings (Hn).

Tag

Similar to tags in HTML, in PDF, tags (formally defined as “structure elements”) categorize the semantic intent of the content they enclose to enable appropriate representation of that content within software, including assistive technology.

Examples of tags in PDF:

The P tag indicates a paragraph
The H1 tag indicates heading level one
The L tag indicates a list

PDF allows for custom tags. Accessible PDF requires that custom tags are role mapped to PDF’s standard tags to ensure correct interpretation.

PDF/UA, a technical specification, uses “structure element” which is the technical term for tag.

Tag set

An informal term for the set of tags within a given standard structure namespace. The PDF 1.7 tag set is the default standard structure namespace in PDF 2.0.

Tag tree

The tag tree represents a hierarchy (formally defined as “structure hierarchy”) of the complete real content of a document, and communicates to software the author’s intent for all the real content on the page.

The tag tree specifies the logical content order used by assistive technology.

TECHNICAL NOTE: The “tag tree” concept is discussed in the PDF specification, PDF 2.0, using the term logical content order. Earlier PDF specifications used the term logical structure order.

Techniques for accessible PDF

A technique for accessible PDF describes a procedure for determining how a certain accessibility requirement can be fulfilled in the PDF format. The PDF Association’s PDF Accessibility Liaison Working Group (LWG) provides Fundamental Techniques and Specific Techniques.

Text content

Text content means every character in the document, e.g letters, digits, punctuation marks and special characters. To make a text content machine readable, it must include Unicode mappings.

Text recognition

Text recognition software allows retrieval of machine readable text from an image. Popular text recognition technologies include Optical Character Recognition (OCR) and Intelligent Character Recognition (ICR). The most common use case for text recognition in the PDF/UA context is an image intended to be consumed as text.

The accuracy of recognized text is critical to accessibility applications, and thus, users performing text recognition have an obligation to ensure that the final result is correct. The following image demonstrates how uncorrected text-recognition can impact the accessibility of content.

Title (metadata)

The metadata title is used by applications to identify the document in various contexts, from file browsers to window labels, and is required for PDF/UA conformance. Such information is usually related to any title information provided in page content, but it doesn´t need to match it.

Title (on-page)

The title is the main name of the document as it appears on the page and when printed. The title helps readers understand what the document is about. It is usually placed on the first page and formatted prominently (e.g., larger font, bold, or centered). This “on-page” title is not to be confused with the document’s metadata.

Unicode

Unicode is a global information technology standard for the consistent identification of characters, e.g. letters, numbers, symbols and emojis. Correct Unicode mappings allow characters to be machine-readable such that assistive technology can interpret it.

Unicode provides a unique number for every character independently from software or language. Some examples:

Latin Capital Letter “A” is mapped to U+0041
Latin Small Letter “a” is mapped to U+0061
Commercial “@” (at sign) is mapped to U+0040
The accented vowel “ü” is mapped to U+00FC
The ligature ”œ” is mapped to U+0153

Use Case

In the PDF Association's Techniques for Accessible PDF, "use case" refers to a specific aspect of PDF with relevance to accessibility. Provided for each use case are: test criteria, pass and fail examples, as well as possible techniques and/or methods.

Featured articles

Discover pdfa.org

Key resources

Get involved