PDF Association logo

Discover pdfa.org

Key resources

Get involved

How do you find the right PDF technology vendor?
Use the Solution Agent to ask the entire PDF communuity!
The PDF Association celebrates its members’ public statements
of support
for ISO-standardized PDF technology.

Member Area

Image of a page including broken images.

Don’t risk losing users’ trust: future-proof your PDF implementations

To avoid confusing and frustrating users, applications can and should be robustly prepared for significant changes to PDF – with end user messaging in mind – well before files start to appear in the marketplace.
About the author: Peter Wyatt is the PDF Association’s CTO and an independent technology consultant with deep file format and parsing expertise. A developer and researcher working on PDF technologies for more than … Read more
Peter Wyatt
Peter Wyatt
July 23, 2024

Article


Peter Wyatt


July 23, 2024

Article



As the members of PDF’s technical community continue their work on defining a future version of PDF, with potentially large changes to core PDF (via the Imaging Model Technical Working Group), it is important to consider the impact of these changes on today’s software, and how implementers can best support end users. As an industry, we should be concerned about how we might confuse end users when they try to use future PDF files.

Historically, PDF creators ensured backward compatibility when writing out PDFs according to new PDF specifications. In the past, when significant “breaking” changes were made to PDF, Adobe controlled the specification and coordinated changes to PDF’s specification with releases of their dominant viewing and creation applications. With regular updates and “breaking changes” occurring every 3-4 years throughout the 1990s and 2000s, other PDF vendors at the time had to rush out their own software updates to regain compatibility. As a result, they were very aware of the importance of “future-proofing” their software.

PDF creation software is responsible for whether (or which) additional “fallback” data is written (possibly at the cost of file size or complexity) and the specific PDF feature sets they choose to use. As is already the case today, some software packages will doubtless provide a user interface to let users control what PDF version or features are used, but as the future PDF will include critical new features such as HDR, new raster image formats, better compression strategies and more, sometimes authors will need to use the latest and most capable PDF.

However, this does not diminish the responsibility that PDF-consuming software (viewers, applications, etc.) have to ensure a reliable experience. No assumptions can be made about how new features might or might not be implemented, how PDF versioning might work, etc. so recommendations at this point must be oriented towards those things that a robust, well-written, forward-leaning, end-user-centric general PDF software can do to best support any possible future version of PDF. As software roll-outs can take a long time in larger organizations it’s worth getting in front of this issue sooner rather than later.

A lot of relevant experience can be drawn from web browsers and how they manage similar incompatibility issues, including missing functionality. For example, “reporting to a user” as mentioned in this article does not always have to be via a primary user interface dialog or prominent banner but might be via logs (already commonplace in web browsers), a status bar symbol, or an on-screen widget. The key point is that an end-user must have some means of learning that an incompatibility has been detected between their software and the document they’ve opened and are (presumably) trying to understand!

Screenshot of a web browser console log.
A web browser console log, where detailed compatibility information can be found.

Responsible product managers should insist on a future-proofing strategy to ensure that applications continue to provide a reasonable end-user experience when (not if!) their software encounters a “future PDF”.

BUSINESS NOTE

To avoid confusing and frustrating users, applications can and should be robustly prepared for significant changes to PDF - with end user messaging in mind - well before files start to appear in the marketplace.

BUSINESS NOTE

To avoid confusing and frustrating users, applications can and should be robustly prepared for significant changes to PDF - with end user messaging in mind - well before files start to appear in the marketplace.

Some ISO-standardized subsets of PDF, most notably PDF/A and PDF/X, already contain specific content and rendering requirements. Accordingly, this information applies only to generic PDF applications, viewers, and web servers, not to software operating in PDF/A or PDF/X conformance modes.

PDF syntax

The fundamental COS syntax and basic PDF objects can probably be assumed to remain unchanged, as they haven’t changed since PDF was first introduced. Note that the introduction of UTF-8 in PDF 2.0 did not change the lexical or syntactic representation of literal or hexadecimal string objects - only their contextual interpretation. (Coincidentally, a lot of PDF software already supported UTF-8 because of its ubiquity when PDF 2.0 was first released in 2017!)

Furthermore, PDF is independent of hardware architecture as seen with the smooth transition from 16-bit to 32-bit to 64-bit and GPU-based computing. As computing power increases, PDF files and content are also growing in size, scope, and complexity, so all PDF implementations must be aware of overflow and underflow - as is critical nowadays anyway, due to cyber-attacks.

Pre-checks at runtime

Enabling end-users to understand discrepancies between the capabilities of their PDF software application and their PDF documents is fundamental in helping them reliably - and safely - consume and trust their documents - What You See Must Be What You’ve Got!

Check and report the PDF version

Traditionally, the primary means of indicating support and compatibility was the PDF’s version number, but as previously discussed version numbers alone are limiting given the widely-used extensions to PDF 1.7 and PDF 2.0. Even more significant, versions are also an unreliable technical indicator of the features present within a given PDF file. However, as is the case today, it is insufficient for implementations to solely rely on this mechanism at runtime.

“Breaking PDF” (that is, when a new definition of PDF changes previous definitions in a significant way that’s not backwards-compatible) will mean a new PDF version number, so correctly processing (per PDF’s specification) both the 1st line PDF file header (%PDF-x.y) and the Version entry in the Document Catalog dictionary will continue to provide useful information to end-users. As we recently reported, however, following many years of changes to the PDF specification that did not “break” PDF, proactively informing the user that the software has encountered unexpected content is no longer standard practice.

Check and report Developer Extensions

Developer extensions are part of how software learns of the features contained in current and future PDF files.

As previously described, supporting developer extensions (introduced with PDF 1.7) is fundamental to supporting PDF 2.0 now, and into the future. Many ISO-standardized extensions for PDF 2.0 are already published, in addition to vendor-specific extensions.

Check and report Document Requirements

PDF 1.7 introduced Document Requirements (via the Requirements entry in the Document Catalog dictionary) as “an array of requirement dictionaries that shall represent requirements for the document” (Table 29, ISO 32000-2:2020). As described in 12.11.2:

Such entries enable the document to identify feature(s) of PDF beyond those commonly expected, such as 2D graphics rendering, and are required for correct handling in accordance with this document. In addition, although not required for viewing, a document may also use requirement values that stipulate required features of interactive PDF processors such as the ability to interact with or modify the document.

The Document Requirements feature therefore provides a mechanism that authors of PDF files can use to identify new features that are critically necessary for understanding a particular document. Fully implementing Document Requirements ahead of future “breaking changes” to PDF includes calculating penalty values, processing requirement versions, and communicating with end users to provide a robust and backward-compatible means of supporting future PDF documents for which the author has defined certain requirements.

Features supporting the future

Many current PDF features are specified in ways that support future extensions, with many already extended in previous versions of PDF. This includes numerous dictionary keys whose possible values were extended, where the meaning of bit positions in bitmask flags changed, or where a range of integers were altered.

Indicating that a future PDF file is “corrupt” or failing to indicate missing content can lead to end-user confusion or reduce user’s trust in the format. In the same way that differences in the appearance of pages or the behavior of implementations are leveraged by malicious actors today, providing clear information to end users can help to mitigate all these risks. Although the PDF specification identifies that future PDF versions might extend many features, there has been a significant gap (~20 years!) since PDF’s last major “breaking” update. Accordingly, PDF application developers should review their implementations to ensure that prior assumptions still hold true today.

Many historical changes in PDF by version are identified in the Arlington PDF Model:

Addition of new keys to existing dictionaries: this is the most common way that PDF has been extended and is easily recognized by differences in the 3rd field called “SinceVersion” of the TSV model data files.

Addition of new types to existing keys: search for version-based predicates in the 2nd field called “Type”. For example:

  • "fn:SinceVersion(1.1,array);name"
    means an array type was added to the existing name type from PDF 1.1. This was when the Indexed color space was extended to support the many new color spaces that were added with PDF 1.1;
  • "array;fn:SinceVersion(1.3,dictionary);name;stream"
    means a dictionary type was added to the existing array, name, and stream types from PDF 1.2 with PDF 1.3. This was when PDF 1.3 introduced Type 2 and Type 3 functions and the existing TransferFunctions in the graphic state parameter and halftone objects were extended;

Addition of new permitted values: search for version-based predicates in the 9th field called “PossibleValues”. For example:

  • Integer values: when JPEG 2000 was introduced with PDF 1.5, the BitsPerComponent entry in Image XObjects was extended to include the value 16:
    [1,2,4,8,fn:SinceVersion(1.5,16)]
  • Bitmask flags: the 32-bit bitmask flags used in various PDF objects were extended over time, such as for the annotation F entry:
    [fn:Eval(fn:BeforeVersion(1.7,fn:BitsClear(10,32)) && fn:SinceVersion(1.7,fn:BitsClear(11,32)))]
  • Names: A common method of extending PDF, new names were added to the list of supported stream Filters, DigestMethods, structure attribute owners (O entry), etc.:
    [ASCIIHexDecode,ASCII85Decode,LZWDecode,fn:SinceVersion(1.2,FlateDecode),RunLengthDecode,CCITTFaxDecode,fn:SinceVersion(1.4,JBIG2Decode),DCTDecode,fn:SinceVersion(1.5,JPXDecode),fn:SinceVersion(1.5,Crypt)]

Many historical changes in PDF by version are identified in the Arlington PDF Model:

Addition of new keys to existing dictionaries: this is the most common way that PDF has been extended and is easily recognized by differences in the 3rd field called “SinceVersion” of the TSV model data files.

Addition of new types to existing keys: search for version-based predicates in the 2nd field called “Type”. For example:

  • "fn:SinceVersion(1.1,array);name"
    means an array type was added to the existing name type from PDF 1.1. This was when the Indexed color space was extended to support the many new color spaces that were added with PDF 1.1;
  • "array;fn:SinceVersion(1.3,dictionary);name;stream"
    means a dictionary type was added to the existing array, name, and stream types from PDF 1.2 with PDF 1.3. This was when PDF 1.3 introduced Type 2 and Type 3 functions and the existing TransferFunctions in the graphic state parameter and halftone objects were extended;

Addition of new permitted values: search for version-based predicates in the 9th field called “PossibleValues”. For example:

  • Integer values: when JPEG 2000 was introduced with PDF 1.5, the BitsPerComponent entry in Image XObjects was extended to include the value 16:
    [1,2,4,8,fn:SinceVersion(1.5,16)]
  • Bitmask flags: the 32-bit bitmask flags used in various PDF objects were extended over time, such as for the annotation F entry:
    [fn:Eval(fn:BeforeVersion(1.7,fn:BitsClear(10,32)) && fn:SinceVersion(1.7,fn:BitsClear(11,32)))]
  • Names: A common method of extending PDF, new names were added to the list of supported stream Filters, DigestMethods, structure attribute owners (O entry), etc.:
    [ASCIIHexDecode,ASCII85Decode,LZWDecode,fn:SinceVersion(1.2,FlateDecode),RunLengthDecode,CCITTFaxDecode,fn:SinceVersion(1.4,JBIG2Decode),DCTDecode,fn:SinceVersion(1.5,JPXDecode),fn:SinceVersion(1.5,Crypt)]

As the PDF Association’s Technical Working Groups are planning more fundamental changes to PDF, it becomes more important for PDF software to actively support such flexibility to ensure a reasonable end-user experience when encountering PDF files written against new and future versions and extensions.

Unencrypted File Wrappers

PDF 2.0 introduced unencrypted wrapper documents (ISO 32000-2, 7.6.7) as a general mechanism enabling applications to provide users with a meaningful experience even when encountering proprietary or otherwise unknown future encryption algorithms. This feature operates by “wrapping” the encrypted payload PDF within an unencrypted PDF document that any PDF software can read, enabling the unencrypted content to present instructions or other author-controlled content to end users, regardless of whether their software supports the payload’s advanced encryption.

Naturally, it is for the PDF creation software to decide if an encrypted PDF needs to be wrapped, but the increasing adoption of modern encryption makes it important for PDF software to include unencrypted wrapper detection to future-proof handling of proprietary and future encryption algorithms.

Unsupported annotation types

When PDF was introduced in 1993 PDF 1.0 defined 2 types of annotations. More were added over successive editions, such that PDF 2.0 defines 28 annotation types (3 of which are now deprecated). The core PDF specifications have always explicitly stated that annotations were extensible with the latest PDF specification clarifying certain aspects (in ISO 32000-2, 12.5.1). This very clearly states that PDF software can - and should - expect new annotation types and provide certain basic behaviors even for unknown annotations:

A PDF processor shall provide annotation handlers for all of the conforming annotation types. The set of annotation types is extensible. An interactive PDF processor shall provide certain expected behaviour for all annotation types that it does not recognise, as documented in 12.5.5, "Appearance streams" and "Table 167 - Annotation flags" (bit positions 1 and 2).

Annotations have many roles in PDF. In addition to representing end-user visible content (such as markup and review comments), annotations also enable links, forms, digital signatures, multimedia, and more, so unsupported annotations may have a large impact on the end-user experience. PDF’s Document Requirements feature (see above) was introduced, in part, to allow authors to communicate such critical impact for specific documents.

End-users should be informed that unsupported annotation types were encountered (as defined by the required Subtype entry in all annotation dictionaries). How this is achieved and precisely what information the end user is told is not specified, but for a good UX it’s important to avoid flooding end users with numerous modal dialogs or overwhelming them with technical information.

Unrecognized or unsupported named resources

The Resources dictionary of content streams currently defines 7 different types of sub-dictionaries that match so-called “named resources” to specific objects, including ColorSpace for color spaces, XObject for Form and Image XObjects, Fonts, Pattern, etc. Future PDF versions may extend this idiom to include additional types of new named resources, or extensions to the current set of resource types (such as extra entries in color space array objects).

Separating rendering from extraction/reuse and other functionality

Logical structure and Tagged PDF define flexible mechanisms to represent the semantics of document content in PDF files. Even if future PDF content streams contain new operators, additional operands, images using unknown filters, unknown named resources, etc., today’s semantic data structures underlying these features - such as MCIDs, Alt (like AltText in HTML), Lang, and the logical structure hierarchy that is rolemapped back to a known set of tags - should provide sufficient information for continued meaningful extraction and reuse.

Problems with rendering (the visual appearance) may not necessarily prohibit content extraction from PDFs written to a future PDF version, so implementers may wish to separate these concerns. Other applications such as metadata extraction, document or page metrics, etc. should also function even if a rendering cannot be fully processed.

File structure

Any alteration to PDF file layout or structure in future versions of PDF has the greatest potential to impact end users with unopenable and/or non-interoperable PDFs, or crash today's PDF applications (a form of DOS attack!). Based on prior experience when cross-reference streams, object streams, and so-called “hybrid-reference” PDFs were introduced with PDF 1.5, any level of structural change to the PDF file format must be carefully considered by implementers.

Cross-reference stream data

PDF 1.5 introduced cross-reference streams to reduce overall PDF file size via stream compression filters. This feature utilizes binary data in “fields”; its definition includes a forward-looking statement about new field types:

Each entry in a cross-reference stream shall have one or more fields, the first of which designates the entry’s type (see "Table 18 — Entries in a cross-reference stream"). In PDF 1.5 through PDF 2.0, only types 0, 1, and 2 are allowed. Any other value shall be interpreted as a reference to the null object, thus permitting new entry types to be defined in the future. (7.5.8.3 Cross-reference stream data)

The current PDF specification requires implementations to ensure that they correctly treat as a reference to the null object any field value other than the currently permitted values 0, 1, or 2 to reliably ensure backward compatibility.

Content Streams

Correctly coping with unexpected input when parsing content streams (i.e. the graphics operators and their operands) is important as page rendering may be affected. In some cases (and where otherwise not prohibited by the PDF specification), providing a log, warnings, or proxy content to users that their page display may not be correct might be considered helpful… or vital, depending on the case.

Ensuring use of compatibility operators

Any new operators and their operands will likely be specified to lie within the existing BX/EX compatibility operator pair, as that is what the BX/EX compatibility operators (ISO 32000-2, Table 33) are intended for:

Table 33 — Compatibility operators
Operands Operator Description
- BX (PDF 1.1) Begin a compatibility section. Unrecognised operators (along with their operands) shall be ignored without error until the balancing EX operator is encountered.
- EX (PDF 1.1) End a compatibility section begun by a balancing BX operator. Ignore any unrecognised operands and operators from previous matching BX onward.

% FAKE: a new operator “newOP” within BX/EX doesn’t generate errors
BX
0.1 /Two [ 3 4 5 ] newOp
EX

Implementations need to ensure that error messages are suppressed for any unexpected new operands (along with all their operands) that occur between BX and EX. As PDF specifically requires (ISO 32000-2, 7.8.2) that all operands are removed after each operator (including unknown operators), operands cannot alter the behavior of other operators:

Operators do not return results, and operands shall not be left over when an operator finishes execution.

Unlike PostScript, PDF does not have a persistent operand stack.

The only other time in the history of PDF when PDF creators used the BX/EX operators, was when the sh operator was initially introduced with PDF 1.3 back in 2000. Over time and with the adoption of support for the shading operator, correct support for the compatibility operators has eroded.

Additional operands

Like PostScript, PDF places all operands immediately before the operator:

In PDF, all of the operands needed by an operator shall immediately precede that operator. Operators do not return results, and operands shall not be left over when an operator finishes execution. (7.8.2 Content Streams)

This means that if additional operands are ever added to existing operators, they would be added at the start of the line furthest from the operator (at the bottom of the operand stack) to ensure backward compatibility. Thus implementations should ensure that operands are always correctly indexed relative to the operator keyword and the top of the stack:

% Current ri operator has a single name operand
/RelativeColorimetric ri

% FAKE: a future PDF version with 2 additional operands
% This is backward compatible.
/SuperDuperRI 1.23 /RelativeColorimetric ri

The PDF specification requires the implementation to remove all operands after processing each operator, ensuring that any additional operands that might be added in future versions of PDF cannot influence subsequent operators in the content stream.

Images

Whether they are inline images or Image XObjects, raster images in PDF files rely on stream filters to encode pixels. Future versions of PDF may introduce new image codecs or formats that today’s implementations will not understand or support. Similar previous changes to PDF included the introduction of JBIG2 in PDF 1.4 and JPEG 2000 in PDF 1.5.

As discussed in Preparing for PDF files “from the future”, PDF software generally knows when it is painting or processing images due to context. As is common in browsers, inserting proxy placeholders for images with unsupported filters is one way to ensure end-users are aware that their document isn’t fully supported.

3 examples of browser proxies for unsupported content.
Examples of browser proxies for unsupported content.

Unsupported general compression filters

New general-purpose compression filters may be added in a future PDF specification that can compress all kinds of stream data, such as content streams, cross-reference streams, object streams, image and form XObjects, certain shading objects, XMP metadata, ICC profiles, embedded files, or other assets needed during processing. This type of change has only happened once before in the history of PDF, with the introduction of FLATE back with PDF 1.2 (1998).

It is impossible to predict how an unsupported general-purpose compression filter might impact page appearance, the PDF navigation experience, interactivity, or other qualities. As described in Preparing for PDF files “from the future”, ensuring that users are informed about incompatibility is critical to a good user experience. PDF creation software can also provide options so that authors can make informed decisions about how to handle compatibility problems.

Unsupported stream data formats

Once decompressed with one or more known filters, stream data may not be recognized by today's PDF parsers. In many cases, the context of a given stream is readily understood by today's software from the way it is referenced, such as whether the unsupported data is a content stream, color space-related data, font-related data, etc.

If the unsupported data is related to page content in such a way that nothing can be painted, (e.g., content streams, XObjects, shadings, patterns, color data, annotation appearances, etc.), then the bounding box of the unsupported content area might be replaced with a proxy, in a similar manner to unsupported image formats. This provides some visual indication to users that the software was unable to fully handle something in their PDF document.

Unsupported keys in existing dictionaries

Dictionaries can occur in the body of a PDF file as direct and indirect objects, in object streams, and in content streams as operands.

Unsupported keys in dictionaries are common today, with private data and 2nd and 3rd class names. Today, most PDF software supports only a necessary subset of keys, and will simply ignore other entries. However, to preserve features used by other software, PDF software that modifies or transforms existing PDF files to new PDF files might need to ensure that all unknown first-class and second-class name keys and their values are faithfully preserved in the output PDF. Of course, PDF redaction software (and other classes of specialized software) may prefer not to do this.

Note that ISO 32000-2, 7.5.6 already requires this behavior for the trailer dictionary when writing incremental updates:

The added trailer shall contain all the entries except the Prev entry (if present) from the previous trailer, whether modified or not.

Unsupported object types for existing keys

Another common extension to past versions of PDF was the inclusion of additional object types for existing keys in dictionaries. As the new type is generally unexpected, existing implementations should already safely ignore such changes. However there is some ambiguity with existing requirements; for example, some keys are defined as being “present”, but it’s not clear whether this requirement applies to only a key name or the key/value pair according to the defined types. The PDF Association is clarifying these situations as they are brought to the respective working group’s attention. If you have a question along these lines, feel free to let us know.

Extra elements in existing arrays

Like dictionaries, array objects can also occur in the body of a PDF file as direct and indirect objects, in object streams, and in content streams as operands. In most cases, the PDF specification does not define what should happen if an array of a certain length has more array elements than the current PDF specification implies.

Beyond the cases where the PDF specification defines relationships that address array length data integrity, it’s feasible that a future version of PDF might append additional entries to existing arrays. In such cases, extra array elements at the end of an array will likely get ignored by today’s software. As with unsupported keys in dictionaries, PDF software that modifies or transforms PDF files should consider ensuring that additional array values are faithfully preserved in the output PDF. Also as with dictionaries, PDF redaction, and other specialized software may choose not to do this.

Conclusion

There are thousands of PDF software applications that will need updates to robustly support new “breaking” PDF features that will be introduced in the future. Given that many PDF libraries and applications were developed since the last significant breaking change - in PDF 1.6 way back in 2004 - it is critically important that the PDF ecosystem proactively considers the processing of future incompatible PDF files now before their users encounter them in the wild.

PDF has always been and always will be fully extensible; PDF software should always be written to reflect that fact.

WordPress Cookie Notice by Real Cookie Banner