In the summer of 2019 the PDF Association’s PDF/UA Technical Working Group released Tagged PDF Best Practice Guide: Syntax - to provide developers and expert users with formal advice and best practices for implementing tagged PDF. The guide is also useful to those generating tagged PDF to create accessible documents.
Since its publication the Guide’s recommendations for document titles have generated some questions. This article, authorized by the PDF/UA TWG, sets out to answer them.
NOTE: For readability I will use the common term “tag” to refer to PDF’s standard structure types.
The Tagged PDF Best Practice Guide: Syntax, in clause 4.2.2.2, states as follows:
"Since PDF/UA-1 does not require any specific structure type for title content, it is permissible to structure such content with either <H1> or other structure element types (typically, <P> or structure element types mapped to <P>).
Page content representing the title can - especially in publications - appear several times in the document. If <H1> structure elements are used to enclose such content, it is recommended that only one such instance of the title be structured as <H1>.
Since headings commonly appear in tables of contents, and since document titles do not normally appear in tables of contents, a future-proof (PDF 2.0) approach would be to use a <Title> structure type (which is defined in PDF 2.0, see Annex B) mapped to the <P> structure type."
So which is it? Should authors and remediators tag document titles with <H1> or with <P>, or maybe something else? And what are the consequences of one or the other choice?
The typical visual indicators denoting headings of various levels (font size, weight, color, etc.) help readers to quickly locate content of interest, especially in longer or highly structured content. In the PDF and hardcopy contexts, headings are often used to create a table of contents displaying the document’s structure in a hierarchical tree (on a separate page and/or using PDF’s bookmarks feature).
Users who depend on assistive technology (AT) likewise rely on headings to enable navigation by providing them with the page or document’s structure in an accessible form.
As it is fundamentally oriented towards web content rather than web-independent content, WCAG can be difficult to apply to PDF, so it's useful to begin by reviewing fundamental distinctions:
Accordingly, headings are relatively less significant for navigation in HTML because each HTML page tends to be short, with relatively few headings or deeply nested heading structures. PDF, on the other hand, must be able to accommodate documents of thousands of pages, or deeply nested subheadings.
WCAG is technically neutral, and generally deals with web content; it does not directly address offline documents, making direct application of WCAG rules and techniques less obvious. Nonetheless, the specification is clear (success criterion 2.4.2, Level A) that documents (“pages” in WCAG’s vernacular) must have programmatically identifiable titles. WCAG’s support documentation identifies a PDF technique for this purpose leveraging the document’s metadata, which is a reasonable parallel for HTML’s <title> element in the <head>.
WCAG is silent, however, regarding the document title as expressed in the <body>. The HTML convention is to use <h1> for such content, possibly <h2> for subtitles, and then allocate the remaining heading levels up to <h6> to denote important content.
In HTML, document titles in the page’s <body> are typically marked <h1>. Subtitles may (or may not) be marked <h2>. HTML’s limit of 6 levels of headings is very rarely considered a significant constraint.
In PDF, documents may have no headings at all, or a thousand headings. If a document includes title content it may be repeated on several pages.
There’s simply no meaningful parallel between the most important (<h1>) content in the <body> of an HTML page and document title content on PDF page(s).
In HTML <h1> marks the main heading on each webpage irrespective of whether that page is actually a subsection of a larger body of content.
In PDF <h1> marks the top-level logical sections of a larger document. There’s no express or implied relationship between <h1> and any specific URL or physical page.
In HTML the <title> concept is limited to the metadata in each webpage’s <head>, enabling coordination with each page’s respective <h1> content.
In PDF there's XMP Dublin Core metadata for the document’s title.
As discussed above, heading tags already have a clear role in document navigation, and it’s not to mark the title! Nothing in WCAG 2.1 or PDF/UA requires - or even implies - that titles in PDF documents should be tagged as if they were headings.
Here are some other reasons why PDF authors and accessibility remediators should shun <H1> for document titles and use <Title> instead.
Headings allow AT users to move between important subjects in a document. If headings denoting sections of content within the page are confused with titles, which denote the identity of the document itself, the navigational value of headings is compromised.
AT users rely on headings for navigation within the document. If they need to read the document’s title they can access the metadata (which PDF/UA requires to be present) directly without losing their place, exactly as suggested in the formal support documentation for WCAG 2.1 success criteria 2.4.2.
HTML’s limit of 6 heading levels isn’t a problem on most webpages, which tend to be short and/or don’t include deeply-nested headings. However the limit is problematic in PDF, and not only because some PDF documents extend to 7 or 8 heading levels, or even more. If titles are tagged with <H1> only five heading levels remain with which to structure the document. Further, if the author uses <H2> to enclose a subtitle (as many do), only four heading levels remain for organizing the document's content!
PDF 2.0 allows any number of heading levels, just one of many important enhancements to the next-generation PDF format published in 2017.
Document authoring tools use headings to build a table of contents, however these usually don’t include the document’s title, as it’s not part of the document’s structure.
This fact highlights one of the problems, referenced above, in using <H1> tags for titles; AT users have no way to distinguish between the title and the document’s top-level structural headings.
PDF 2.0, published in 2017, introduces the <Title> structure element to resolve the matter and provide a fully semantically appropriate tag for title content in PDF documents.
The best solution for titles is to tag them with a <Title> tag. For those following PDF/UA-1, the first ISO standard for accessible PDF (based on ISO 32000-1, PDF 1.7), this <Title> tag should be role mapped to <P>. For those using PDF 2.0 rolemapping is unnecessary.
If the document’s metadata clearly identifies the document (as PDF/UA-1 requires), the tagged title on the page may also be safely tagged with <P>, as this information is redundant.
Subtitles, if any, should be simply included within the <Title> tag.
Use <H1> to tag the top-most level of organization in the document such as “Introduction”, “Table of Contents”, “Chapter 1”, “Chapter 2” and so on.
Use <H2> and subsequent levels to tag the 2nd and subsequent heading levels below each <H1> in the document.
Klaas Posselt is a graduate engineer in printing and media technology who, following a number of lines of inquiry, eventually landed on the subject of universally accessible PDF documents. He trains, assists and supports clients as they implement and optimize publication processes and move towards new digital output channels including ebooks, accessible PDFs and web platforms. As a Co-Chairman of the PDF …
Klaas Posselt is a graduate engineer in printing and media technology who, following a number of lines of inquiry, eventually …