Preparing for PDF files “from the future”

July 22, 2024

These synthetic files are intended for user experience and human-computer interaction (HCI) specialists to assess the behavior of their current generation of PDF applications to ensure that appropriate user-centric messages are displayed before future PDF changes are made and available to users.

About Peter Wyatt

A developer and researcher working on PDF technologies for more than 20 years, Peter is the PDF Association’s CTO and an independent technology consultant.

Screen-shot visualizing two approaches to an unsupported image filter; one page shows a warning, the other shows nothing.

BUSINESS NOTE

Certain new features in PDF (e.g., HDR images and modern compression options) will represent “breaking” changes from today’s PDF files. To ensure acceptance and adoption of these new features (and retain users’ trust) it’s vital that today’s PDF software provides a user-friendly experience when it encounters unsupported PDF versions or features.

It has been more than 20 years since new filters were introduced with PDF 1.5 that resulted in unprocessable files by existing software. In the intervening years, PDF applications have fully implemented all filters, and support for helpful user experiences (UX) when encountering unknown features has unfortunately slowly eroded.

As the PDF Association's Imaging Model Technical Working Group considers additions to future versions of PDF to support better compression, HDR image formats, and other features, user experience must remain top-of-mind. Accordingly, the PDF Association has created a set of test files to help developers of today's PDF applications prepare to gracefully handle unexpected features.

One core feature of PDF is the set of supported filters (ISO 32000-2:2020, 7.4), including both general-purpose and image compression formats. Although PDF’s history includes several occasions in which new filters were added (FlateDecode in PDF 1.2, JBIG2Decode in PDF 1.4, and JPXDecode and Crypt filters with PDF 1.5), today, some PDF implementations have lost sight of the user experience when encountering such unknown features (or features of future PDF).

The core PDF specification (ISO 32000) does not prescribe user experience behavior, but as PDF technology moves forward, the user experience must be considered. This may include warning users when opening a PDF document with an unsupported PDF version, or providing proxy content on a page (such as visually indicating missing content - where otherwise no visible indication of a problem may be indicated). Claiming a file with an unsupported PDF version is “corrupt” is not in anyone’s interest, especially users! Appropriately preparing technology for "breaking" changes to PDF is an important early step in driving acceptance and adoption of new features.

Which is better UX?

Dialog with text: "MyPDF PDF Editor This file appears to use a new format that this version of My PDF PDF Editor does not support. It may not open or display correctly and cannot be edited. MyPDF recommends that you upgrade to the latest version of our MyPDF PDF Editor product. Please visit our product site at http://xxx.mypdf.com/get-MyPDF/ Do not show this message again."

Dialog displaying text: "Corrupted PDF document The PDF document is corrupted and unfortunately cannot be read by PradaPDF. (PDF header not found)."

Screen-shot visualizing two approaches to an unsupported image filter; one page shows a warning, the other shows nothing.

PDF streams are used for many important features, ranging from file-level compression (via cross-reference streams and object streams) to optional features such as Linearization data, fundamental features such as page content streams, and rendering-specific data such as embedded fonts and images. In some of these cases encountering an unknown stream filter can result in a complete failure to open and process the PDF (potentially cross-reference streams or object streams) while in other situations some form of fallback or alternate processing might be considered (e.g. font substitution or painting of proxy placeholders for an image). Of course, impacts depend on the specific PDF application, as not all PDF software renders pages and thus may still be able to process files with unknown filters on certain streams.

Test files

This set of synthetic test files in the pdf-differences GitHub repo includes various types of streams individually encoded with a fake "new" filter called XXXDecode while also indicating a future fake PDF version (%PDF-3.x) in the PDF file header. In all other respects, these are valid PDF files consisting of a single PDF page containing an image and some text using an embedded subsetted font.

These files are intended for user experience and human-computer interaction (HCI) specialists to assess the behavior of their current generation of PDF applications to ensure that appropriate user-centric messages are displayed before future PDF changes are made and available to users.

Of course, unsupported filters may occur on any stream so this set of test PDF files must not be considered comprehensive; it’s merely a starting point for assessing today's PDF technology when opening PDFs “from the future”.

The following is a list of some of the included test files in the pdf-differences GitHub repo explaining each use of the fake unknown filter:

UnknownFilter-xrefstm.pdf

Only the cross-reference stream is compressed using an unknown filter (the object stream uses FlateDecode). The PDF file is thus effectively unprocessable without additional recovery processing.

Beyond unsupported filters, consider the more general challenge of new PDF features that might encode cross-reference information in improved ways.

UnknownFilter-objstm.pdf

Only the object stream is compressed using an unknown filter (the cross-reference stream uses FlateDecode). The PDF file is thus effectively unprocessable as many objects are inaccessible, even if the cross-reference data is processable.

Beyond unsupported filters, consider the more general challenge of new PDF features that might encode objects in improved ways.

UnknownFilter-Linearized.pdf

The optional Linearization data is compressed using an unknown filter. The PDF should be fully processable by ignoring the optional Linearization data and processing in the standard manner.

UnknownFilter-PageContentStream.pdf

The page content stream (of the only page in this PDF file) is compressed using an unknown filter. PDF applications may still be able to perform certain functions (e.g. count pages) as the size of the page (MediaBox, Rotate) is unaffected.

Beyond unsupported filters, consider the more general challenges of new PDF features that might represent content streams in improved ways, or introduce new operators.

UnknownFilter-Font.pdf

Only the embedded subsetted font used on the page is compressed using an unknown filter. Some applications may be able to perform font substitution or draw proxy glyphs using the available glyph width information. Other page content (the image) is fully processable. Non-rendering PDF applications may still be able to perform certain functions (e.g. count pages, extract images, etc.)

UnknownFilter-ImageXObject.pdf

Only the embedded Image XObject used on the page is compressed using an unknown filter. Some applications may be able to paint a proxy indication of the missing image as the size and location of the image (i.e. the bounding box) is unaffected. PDF applications that do not render pages may still be able to perform certain functions (e.g. count pages, extract fonts or text, etc.)

Conclusion

As PDF Association stakeholders and experts are now actively working on advancing PDF with new features, backward compatibility is an important consideration. However, in adopting new imaging formats and updating compression algorithms, PDF files in the future can be expected to include some features that today’s PDF applications simply cannot process. For such files, it is important to consider the messages and dialogs presented to end users so as not to confuse or unduly worry them. Hopefully, this set of targeted test PDF files will encourage the industry to address this concern systematically… before such files appear in the marketplace!

Featured articles

Discover pdfa.org

Key resources

Get involved