PDF versions

March 25, 2024

How meaningful is PDF’s version number? SafeDocs’ research reveals that 20% of extant files contain features that exceed the stated version!

About Peter Wyatt

A developer and researcher working on PDF technologies for more than 20 years, Peter is the PDF Association’s CTO and an independent technology consultant.

BUSINESS NOTE

In a significant proportion of files, PDF version information does not reflect the features used in a given PDF file, and is therefore misleading to end-users. When PDF version data is combined with PDF Extensions and metadata, a more accurate understanding can be achieved, and interoperability improved.

A PDF file identifies its version using the file header (%PDF-x.y) or, if present, the Version entry in the Document Catalog dictionary. Typically, the version in the file header represents the version of the original PDF when it was created. The Version entry allows software to update the version when adding an incremental update, which may introduce features from newer versions of PDF. As a result, a PDF file may thereby accrete multiple Document Catalog dictionaries over multiple incremental updates, with each update changing the PDF version as each updating software thinks is appropriate.

As a PDF file is typically processed from the end of the file, the last incremental update and potential Document Catalog will be encountered first, with the latest updated Version entry prompting software to alter its behavior if necessary. Of course, the version number may be greater than is required by the features in the file. So, for example, a PDF 1.6 file may contain only features defined in PDF 1.4 or earlier, but a PDF 1.4 file should not contain features defined in PDF 1.5, PDF 1.6 or later.

One point of potential confusion is that a PDF file’s XMP metadata may also identify a PDF version within the Adobe PDF namespace, e.g.:

<pdf:PDFVersion>1.7</pdf:PDFVersion>

However, the PDFVersion field in XMP is purely descriptive metadata, and is not defined to be normative in any edition of ISO 32000, the core PDF specification, or any ISO PDF subset.

Version ground truth

The Arlington PDF Model, in the third “SinceVersion” column of each TSV, provides the ground truth version of the PDF specification in which that key or array element was originally introduced. In some cases, later versions of the PDF specification have extended the set of permitted values for keys; this is expressed using the fn:SinceVersion(...) predicate. For example, the WritingMode entry of a Structure Attribute dictionary (StructureAttributesDict.tsv) was first introduced in PDF 1.4 but new values such as TbLr were introduced in PDF 2.0:

[LrTb,RlTb,TbRl,fn:SinceVersion(2.0,TbLr),fn:SinceVersion(2.0,LrBt),fn:SinceVersion(2.0,RlBt),fn:SinceVersion(2.0,BtRl),fn:SinceVersion(2.0,BtLr)]

Are the stated versions of real-world PDF files meaningful?

In 2022, during DARPA’s SafeDocs program, one million PDF files were randomly sampled from the CC-MAIN-2021-31-PDF-UNTRUNCATED corpus and analyzed to evaluate the reliability of real-world PDF version information in terms of predicting the PDF features present in the file. The analysis used the Poppler pdfinfo command line utility to report the PDF version (which was manually confirmed by source code inspection to correctly support processing both the PDF file header and Document Catalog Version entry) as well as Apache TIKA and the Arlington PDF Model TestGrammar application.

290 // Return the PDF version specified by the file (either header or catalog).
291 int getPDFMajorVersion() const { return std::max(headerPdfMajorVersion, catalog->getPDFMajorVersion()); }
292 int getPDFMinorVersion() const
293 {
294     const int catalogMajorVersion = catalog->getPDFMajorVersion();
295     if (catalogMajorVersion > headerPdfMajorVersion) {
296         return catalog->getPDFMinorVersion();
297     } else if (headerPdfMajorVersion > catalogMajorVersion) {
298         return headerPdfMinorVersion;
299     } else {
300         return std::max(headerPdfMinorVersion, catalog->getPDFMinorVersion());
301     }
302 }

The version of the Arlington PDF Model TestGrammar application used for this analysis reported only the PDF version associated with dictionary keys and array elements, and did not report against specific key values - thus almost certainly under-reporting inaccurate versioning. So, for instance, in the example above, all structure attribute dictionary WritingMode keys were recorded as PDF 1.4, even if the key value was TbLr, which was only introduced in PDF 2.0 and thus the PDF version of such a file should have been reported as PDF 2.0.

Of the one million PDF files surveyed, 4,711 files could not be processed by all 3 tools and were ignored, leaving 995,289 PDFs for which a self-identified version and a latest feature version could be reliably determined as shown below:

	Self-identified version
Latest Feature Version		1.0	1.1	1.2	1.3	1.4	1.5	1.6	1.7	2.0
	1.0	208	218	1,389	3,232	5,578	372	452	279	-
	1.1	2	656	1,712	10,787	11,797	278	397	5,007	-
	1.2	2	103	1,547	13,599	5,131	4,008	149	14,235	-
	1.3	17	169	2,379	22,998	8,022	645	148	3,034	-
	1.4	18	72	1,351	31,091	154,162	14,691	10,374	79,704	23
	1.5	-	1	276	13,339	59,530	143,205	72,959	90,049	202
	1.6	-	-	32	1,244	26,847	62,120	33,504	62,396	22
	1.7	-	-	11	394	1,610	907	1,695	10,785	-
	2.0	2	10	13	398	829	599	822	1,440	12

The self-identified versions were consistent between the 3 tools used, as all 3 tools implemented logic to process both the PDF file header and Document Catalog Version entries, and report the latest version.

Outcome	Count	Percent
PDF’s stated version equals the latest feature version (diagonal)	367,077	36.9%
PDF’s stated version is higher than the latest feature version (above diagonal)	420,889	42.3%
PDF’s stated version is lower than the latest feature version (below diagonal)	207,323	20.8%

Key finding: more than 20% had an “invalid” PDF version!

These results show that 79.2% of the approximately one million randomly sampled processable web-sourced PDFs had a PDF version correlating with the content of the PDF. Further, as noted above, this is a conservative value since key values were not version-checked - so potentially about one-quarter of real-world PDFs might contain features exceeding their stated version number!

Does the PDF version actually matter?

As our experience with everyday PDF files appears to be unhampered by these inaccurate version numbers the answer, generally speaking, is “no”.

One important explanation for why the file’s stated version number lacks relevance is that most PDF features are designed to be backward-compatible. There is no need for general PDF software to fail on a document or warn users when it encounters a minor mis-versioned feature - either the software is coded to understand the feature or it ignores the effectively unknown feature(s) that it doesn’t understand. This is today’s experience with most PDF software.

Learning from the past

However, historically this was not always the case. In 2001 PDF 1.4 introduced the transparent imaging model which “broke” the existing opaque imaging model PDF had inherited from PostScript. Although some implementations of the time attempted to do “quick fixes” by way of using the PDF version to determine which type of rendering technology to use, it soon became clear that this approach was unreliable. Over time all PDF software has come to implement the transparent imaging model; ISO 32000-2 even defines as normative its Annex Q “Method for determining transparency on a page” to avoid such issues.

Other significant PDF features introduced with different PDF versions, such as new image formats, compression and encryption algorithms, new annotation types, cross reference and object streams, etc., may have resulted in different behavior depending on whether one’s PDF software understood the new features. Historically, this lack of support ranged from treating the PDF file as completely unreadable to simply failing to correctly display the respective content:

PDF 1.3 ICC color profiles → incorrect color appearance due to approximated alternate color processing
PDF 1.4 transparent imaging model → wrong appearance if transparency is used with opaque-only rendering software
PDF 1.5 object streams and cross reference streams → cannot read the PDF file!
New filter types for stream objects
- If an image format → might leave a “hole” on the page where the image goes
  - PDF 1.4: JBIG2
  - PDF 1.5: JPEG 2000
- If general compression → cannot read stream objects including (potentially) content streams, images, fonts, cross reference streams, object streams, etc.
  - PDF 1.1: binary data (PDF 1.0 was a text format)
  - PDF 1.2: FLATE
  - “Compact PDF”: BZIP2
New encryption algorithms → cannot render as all streams are encrypted (including all content streams, images, fonts, cross reference streams, object streams, etc)
- Encryption only impacts PDF strings and stream objects
New annotation types → might be visible or listed, but with no interaction

Clearly, the level of impact depends on the specific PDF feature and PDF software. When considering the user experience and the ubiquity of PDF it becomes obvious that blindly rejecting files based on their version is unacceptable in all but highly specialized workflows.

What is important, however, is that end users are informed when their PDF software encounters unknown PDF constructs that might have a bearing on their experience with that software (since not all PDF software necessarily renders pages). Even in the case of unknown encryption or stream filters, software can construct the PDF document object model (DOM) and certain classes of software may still be able to function (e.g. a page counting utility may not be impacted by unknown encryption unless that PDF also uses cross reference streams or object streams).

Looking forward

As we have previously described, PDF 2.0 relies on Extensions dictionaries to identify the latest set of PDF features, so simple version numbering is no longer sufficient. A valid PDF 2.0 file can contain features such as AES-GCM encryption, 3D STEP data, or new types of digital signatures but it needs to explicitly identify each of these PDF 2.0 extensions by using Extensions dictionaries since not all PDF 2.0 software might support these extensions. Additionally, ISO-standardized subsets of PDF such as PDF/A, PDF/X, PDF/VT, and PDF/UA, as well as PDF Declarations, are identified via document XMP metadata, so the industry needs to shift the importance that users have historically assigned to simple version numbers.

The PDF Association has begun work on the next set of significant changes to the PDF Imaging Model which may again lead to various classes of “broken experiences” for users, but it is far too early to tell what kind of issues might result or even how versioning will be defined in PDF’s future. Those involved in this activity are already discussing how to minimize the impact of each feature, but there’s no doubt that moving PDF forward will require some degree of backward incompatibility.

Although regular and automatic software updates are more commonplace today (thanks to cybersecurity issues), there can be no guarantees that every user is always using the most up-to-date software - this is especially true in government and large enterprises where IT departments strictly control deployed software. Thus, today’s implementers need to be seriously thinking about how their software will cope with all kinds of unexpected inputs, and what their end-user experience might be when encountering PDFs from the future (or even today’s PDF with the latest and greatest feature sets!).

A simple test is what happens today when users open PDF 2.0 files with various kinds of advanced ISO extensions? Is nothing indicated? Are error messages meaningful? Can users drill into error messages or obtain more information? Are holes left on pages or is the document truncated with no visual indication that content is missing? Can the user easily access comprehensible information derived from Extensions dictionaries or XMP metadata? This is highly unlikely to be a one-size-fits-all or only a simplistic pass/fail indicator in all cases, but something more nuanced.

Conclusion

It's perplexing that many PDF programs continue to emphasize the PDF version as a key piece of information even as analysis of real-world files reveals this focus is misplaced due to inaccuracies. Moreover, some PDF software ignores the document catalog version and only reports the version number found in the file's header, which can be very misleading. Our experiences also demonstrate that the majority of PDF software processes PDFs without relying on the version number; otherwise, we'd see many more failures. Despite this, there's a widespread misconception that the PDF version is critically important, even though most software ignores it.

From an end-user perspective, it’s important to know when a PDF document contains data that may result in an incorrect experience. Ignorance is not bliss when it comes to communication (or miscommunication!) via digital documents.

As the PDF industry moves forward with significant new features, and as new extensions and industry specifications become available, compliance of PDF files to their specifications is critical irrespective of simple PDF version numbers.

Much of this article was originally presented at a Digital Preservation Coalition (DPC) online Connect Session on 28 July 2023. Many thanks to Dr Tim Allison and the NASA JPL SafeDocs team for conducting this experiment.

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0079. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA).

Featured articles

Discover pdfa.org

Key resources

Get involved