PDF Association logo

Discover pdfa.org

Key resources

Get involved

How do you find the right PDF technology vendor?
Use the Solution Agent to ask the entire PDF communuity!
The PDF Association celebrates its members’ public statements
of support
for ISO-standardized PDF technology.

Member Area

Screenshot of support for UTF-8 and the lack thereof.

How to get started with PDF 2.0

PDF is a large, complex specification! Developers often look for first steps in its implementation. This article offers some suggestions.
About the author: Peter Wyatt is the PDF Association’s CTO and an independent technology consultant with deep file format and parsing expertise. A developer and researcher working on PDF technologies for more than … Read more
Peter Wyatt
Peter Wyatt
August 18, 2023

Article


Peter Wyatt

Peter Wyatt
Peter Wyatt
August 18, 2023

Article



PDF 2.0 (available at no cost from the PDF Association) introduced several new features and technical capabilities which might appear daunting at first read. 

When approaching a new version of a large specification, many developers and product managers look for some first steps, whether adding support for a new key or value in an existing dictionary, updating a library dependency for up-to-date support of a dependent technology, or implementing a feature introduced in PDF 2.0. 

This article supports those considering how to approach adding PDF 2.0 support to existing PDF 1.7 technology.

Get started with some quick wins!

PDF has never required implementation of every feature of every PDF version. PDF’s specification, and a common-sense understanding of what interoperable really means, only requires that if a specific feature is implemented, it must be implemented correctly, as this is how interoperability is achieved:

“Each PDF processor chooses which subsets of PDF functionality to support. For each subset that the processor chooses to support, the processor shall comply with the applicable provisions in this document.”

[ISO 32000-2:2020, clause 6.3.2 "Conformance of PDF processors"]

For this reason, together with real-world software development schedules, each PDF software publisher can focus on introducing and supporting features that are relevant to their marketplace as they see fit. More complex features always need more time for implementation - even today some choose not to implement common features introduced as far back as PDF 1.4 in 2001! 

The take-away: no-one expects or requires that every PDF 2.0 feature be implemented in a single “big bang”.

The PDF specification purposely avoids prescribing user interfaces or the presentation of functionality to users; these choices are left to PDF software publishers. As a result, the intended use cases and expressions of certain features may not be immediately obvious. This article proposes some ideas for supporting some of the “low hanging fruit” introduced with PDF 2.0.

Start with developer extensions 

Although initially introduced with ISO 32000-1 back in 2008, supporting developer extensions dictionaries is an essential first step to supporting PDF 2.0. This is because several new PDF 2.0 features (published as separate ISO Technical Specification (ISO/TS) documents) declare their presence in PDF files through developer extensions dictionaries [ISO 32000-2, 7.12 Extensions dictionary].

PDF 2.0 extended support to include arrays of developer extension dictionaries, allowing multiple extensions from the same developer (using the officially registered developer prefix (_ISO) for ISO-standardized extensions to co-exist in a single PDF 2.0 file:

/Extensions <<
  /ISO_ [
    << /Type /DeveloperExtensions
        /BaseVersion /2.0
        /ExtensionLevel 32003
        /URL (https://www.iso.org/standard/45876.html)
        /ExtensionRevision (2023)
    >>
    <<  /Type /DeveloperExtensions
        /BaseVersion /2.0
        /ExtensionLevel 24064
        /URL (https://www.iso.org/standard/77686.html)
        /ExtensionRevision (2023)
    >>
  /ADBE << … >> % A vendor-specific extension
  ]
>>

Using developer extensions implies that when reporting a PDF’s “version” to an end user, as in a conventional File | Properties… dialog, the additional extensions supported by the PDF file also deserve recognition. This approach makes it easy for end users to know, for example, when they have a PDF 2.0 file that includes both the ISO extension for AES-GCM (ISO/TS 32003:2023) and the ISO extension for STEP 3D data (ISO/TS 24064:2023), as shown in the example above. Of course vendors may also declare their own extensions using their own registered prefix, as has been possible since PDF 1.7 and ISO 32000-1:2008.

UTF-8 strings

BUSINESS NOTE

It’s easy to get started with PDF 2.0! A few small features can add real value for end users while showing your brand and technology leadership.

BUSINESS NOTE

It’s easy to get started with PDF 2.0! A few small features can add real value for end users while showing your brand and technology leadership.

PDF 2.0 extended support of Unicode-based PDF text strings to include UTF-8, the lingua franca of the web and modern computing platforms. As DARPA’s SafeDocs program discovered, many PDF implementations already support UTF-8 (whether intended or not!) due to inherent support inside programming languages and operating systems!

Screenshot of UTF-8 encoding in an unsupportive implementation.
Result in an implementation that doesn't support UTF-8.
Screenshot of PDF bookmark using UTF-8 encoding.
Result in an implementation that supports UTF-8.

Accordingly, PDF files that use UTF-8 text strings for bookmarks, attachment file names, optional content layer names, document information keys, alternative text and so on may already work. The PDF Association provides a sample PDF 2.0 file with UTF-8 strings in common user interface elements to help you get started.

Dependent libraries

The latest PDF 2.0 specification changed several normative references, from dated (and therefore fixed) versions, to undated versions, allowing PDF to keep up with corrections, updates and evolution in other technologies. For example:

  • PDF’s character collections (such as Adobe-KR-9 and Adobe-CNS1-7) now refer to the GitHub repository in which they are maintained. A benefit of GitHub is that it allows developers to look at the history of all versions to more deeply understand changes;
  • the list of 2-letter country codes used in Unicode text strings is now an undated reference to the latest ISO 3166 standard rather than an outdated set from 2006, as was specified in PDF 1.7;
  • references to the old (2003) Unicode ISO 10646 standard were changed to an undated reference, allowing PDF 2.0 to use newer Unicode versions in passwords, CMaps, etc. As described in the Unicode Consortium’s FAQ webpage on the relationship between ISO 10646 and Unicode, “... the Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications” which is obviously important for PDF. 

By checking the normative references and appropriately updating the libraries or third party modules, PDF 1.7 technology can be updated to meet PDF 2.0’s requirements.

Associated Files

Associated Files, introduced to PDF with PDF/A-3 in 2012, were subsequently adopted into the core PDF 2.0 specification. The Associated Files feature allows embedded file streams to include a semantic relationship, as defined by the AFRelationship entry, with any PDF object. 

This feature opens new possibilities for PDF to act as a container format, or “digital envelope”, facilitating new open data workflows and experiences with PDF 2.0 (and PDF/A-3!) files. Examples could include (but aren’t limited to):

  • PDFs with calendars such as an event or convention timetable could associate vCal or iCal files with each timeslot to facilitate integration with calendar systems;
  • Specialized XML data could be associated with chemical equations (such as CML), music notation (such as MusicXML), or any of the hundreds of other formats supported by specific industries;
  • Software code could be associated with examples in textbooks or technical documentation;
  • Tables of figures, plots and charts could be associated with spreadsheets, CSV, JSON or YAML data files;

Table 43 in PDF 2.0 lists the set of pre-defined semantic relationships between a PDF object and the associated file, and includes some examples, as follows:

AFRelationship value Relationship between PDF object and the associated file
Source shall be used if this file specification is the original source material for the associated content.
Data shall be used if this file specification represents information used to derive a visual presentation – such as for a table or a graph.
Alternative shall be used if this file specification is an alternative representation of content, for example audio.
Supplement shall be used if this file specification represents a supplemental representation of the original source or data that may be more easily consumable (e.g., a MathML version of an equation).
EncryptedPayload shall be used if this file specification is an encrypted payload document that should be displayed to the user if the PDF processor has the cryptographic filter needed to decrypt the document.
FormData shall be used if this file specification is the data associated with the AcroForm (see 12.7.3, "Interactive form dictionary") of this PDF.
Schema shall be used if this file specification is a schema definition for the associated object (e.g. an XML schema associated with a metadata stream).
Unspecified shall be used when the relationship is not known or cannot be described using one of the other values.


To provide initial support for the Associated Files feature AFRelationship might present it (along with the associated attachment’s filename, file size, etc.) in the “Embedded Files” or “File Attachments” panel found in many PDF viewers as a means of exposing (at a basic level) this semantic relationship to end users. If nothing else, the presence of a value in a “Relationship” column might indicate an associated file rather than a different kind of embedded file.

An attachments panel showing a Relationship column.

For software that creates or edits PDF files, a new option to “Attach associated file” or “Associate an attached file” might enable more use of this feature, as it would tend to enhance support for open data and help to overcome perceptions that PDF obfuscates information. Adding a spreadsheet, CSV or JSON data to a plot or a table of numbers is a simple step end users could take by leveraging Associated Files with a “Data” relationship. If an attachment is an XML file, then adding the XML schema as an Associated File with “Schema” relationship is an obvious way to help users better utilize custom XML.

For more information about utilizing this feature, see the PDF Association’s publication: “PDF 2.0 Application Note 002: Associated Files”.

Enforced print scaling in Viewer Preferences

As an example of a single key added to an existing PDF dictionary, PDF 2.0 added an Enforce key to the Viewer Preferences dictionary [Table 147, ISO 32000-2:2020]. The new key specifies an array with PrintScaling being the only allowed value currently defined:

10 0 obj
<<
  /Direction /L2R
  /DisplayDocTitle true
  /NonFullScreenPageMode /UseOutlines
  /Duplex /DuplexFlipLongEdge
  /PrintScaling /None
  /Enforce [ /PrintScaling ]
>>
endobj

The purpose of an Enforce entry with a value of PrintScaling while the PrintScaling entry’s value is None is to prevent this setting from being overridden in the user interface of the print dialog. This ensures that printing will always be precisely 1:1 of the PDF, which is a very useful feature when it’s critical to ensure that content is not accidentally scaled up or down. Use cases include label printing, maps, CAD drawings, sewing templates…  anything for which an undisclosed scaling change can ruin the utility of the printout for its intended use.

Embedded file thumbnails

Many PDF viewers include panes listing the embedded files (“attached files”) contained within a PDF file. Such panes are commonly presented as spreadsheet-like data grids, listing information such as the filename, file size and description derived from the File Specification and Embedded File stream dictionaries. As mentioned above, an easy addition is to include the AFRelationship key in this display so users can know that a spreadsheet is the “Source” data for a table, or “Alternative” data for a plot in the PDF, and so on. 

In addition to AFRelationship, PDF 2.0 also added an optional Thumb entry to all file specification dictionaries, defining a thumbnail of the embedded file (as an Image XObject). The addition of a graphical thumbnail view to embedded file panes might further help end users navigate PDF files with many embedded files, especially those with similar filenames.

Metadata streams

For documents…

Since PDF’s introduction, users have become very accustomed to interacting with PDF document metadata such as title, author, subject and keywords. Originally, this information was stored in PDF documents using a Document Information dictionary, but since PDF 1.4 and with the increasing adoption and use of PDF/A, PDF/X and other ISO PDF subsets, a shift to XML-based XMP metadata has begun. PDF 2.0 completes this shift for file-level metadata by (mostly) deprecating Document Information dictionaries in preference to the far more expressive XMP metadata stored in the Metadata entry of the Document Catalog.

However not all viewing software has kept up, with many viewers today still failing to provide access to the document’s XMP metadata, while other software may require multiple clicks, and present an unfriendly technical view of a document’s XMP metadata. As PDF 2.0 adoption grows, along with adoption of PDF/A and PDF/X, PDFs with only XMP metadata may become more prevalent and the typical File | Properties… dialog will likely need to adapt.

For PDF objects…

For many years PDF has made it possible for any PDF object to include an associated metadata XMP data stream in the same way that any PDF object can now have an Associated File (see above). 

This capability could present challenges in how multiple metadata streams, if present in a document, might be meaningfully presented to end users. This might be especially important in redaction workflows, as cleaning PII from metadata is commonly required. Options might include new panes for presentation of all XMP metadata streams. Alternatively (or as a first step), the traditional File | Properties… dialogs that support document-level XMP metadata might be extended to include all XMP metadata streams.

For PDF software that can create or edit PDFs, new options to “Attach metadata” or “Add metadata” are a simple way for end users to enrich their PDFs.

The PDF Association publishes PDF 2.0 Application Note 003: Use of object metadata streams that provides additional information.

(Many) other changes

A nice feature of the PDF 2.0 specification is that the Table of Contents is largely the same as PDF 1.7 (ISO 32000-1:2008), which allows developers to easily switch across to the latest version using the same clause numbers. Of course, new features have new subclauses and some subclauses were renamed to better reflect their content, but navigating ISO 32000-2:2020 (PDF 2.0) is generally the same as navigating ISO 32000-1:2008 (PDF 1.7).

Screenshot of ISO 32000 table of contents.

In addition to those mentioned above, PDF 2.0 introduced other new features which may require more development time and effort:

  • New cryptographic features including far stronger encryption, improved digital signatures and Unicode passwords. Many of these changes are introduced via Developer Extensions published by ISO. These include new ISO Technical Specifications (all of which are available at no cost from the PDF Association):
  • Unencrypted wrapper documents (ISO 32000-2, 7.6.7), as a means of providing a far better user experience for unrecognized and proprietary encryption algorithms such as those used by DRM solutions. This feature facilitates an outer unencrypted “envelope” PDF that can display instructions for obtaining plugins, etc. necessary to decrypt an encrypted “inner” PDF. If your technology supports DRM, then you skip directly to displaying the letter inside the envelope.
  • PRC (ISO 32000-2, 13.6) as a new 3D format, with STEP 3D model support also recently added via an ISO extension (ISO/TS 24064).
  • RichMedia (ISO 32000-2, 13.7) and 3D projection annotations, in preference to the legacy Movie and Sound annotations;
  • Black Point Compensation control at the object level. For more information see the PDF Association publication PDF 2.0 Application Note 001: Black Point Compensation.
  • Page-level Output Intents (see ISO 32000-2, 14.11.5) extend the existing document-level Output Intents capability to allow for specification of Output Intents at a page level. This greatly simplifies things when combining pages across multiple documents, each of which may have a different Output Intent. 
  • For implementations supporting graphic arts and commercial printing applications, spectral data, Halftone Origin, and the ability to to specify the halftone dot shape as a list of spot function names (rather than restricting names from a predefined set);
  • Geospatial features (ISO 32000-2, 12.10);
  • The addition of vendor-neutral Navigators for PDF Portable Collections;

Many improvements to logical structure and Tagged PDF (ISO 32000-2, 14.7 and 14.8), including clarification for using tags defined in the standard structure namespace for PDF 1.7 alongside tags defined in the standard structure namespace for PDF 2.0 (ISO/TS 32005).

 

WordPress Cookie Notice by Real Cookie Banner