Parsing PDF’s dialects

Although PDF is commonly understood as a single holistic file format, in reality, PDF comprises a number of distinct dialects, each designed to support a specific aspect of the format.

About the author: Peter Wyatt is the PDF Association’s CTO and an independent technology consultant with deep file format and parsing expertise. A developer and researcher working on PDF technologies for more than … Read more

Peter Wyatt
October 30, 2023

Article

As a follow-on to my previous article on Perfecting PDF Lexical Analysis, this article discusses the importance of correctly parsing the “dialects” of PDF to ensure compliance with the PDF specification, and thus compatibility and interoperability between implementations. Just like the English dictionary definition, a “dialect” when parsing “is a (relatively small) variation or extension of the language that does not change its intrinsic nature” (Wikipedia). Although PDF is commonly understood and referred to as a single holistic file format, in reality, PDF comprises a number of distinct dialects, with each dialect designed to support a specific aspect of the format.

Software engineers love to reuse code as it makes developing software more efficient and the code base smaller, and easier to understand. The parser’s job is to take the stream of tokens from the lexical analyzer and validate it according to the precise set of rules for a specific dialect. Parsers generate data structures such as ‘parse trees’ or ‘abstract syntax trees’ (ASTs) for use by downstream software and must reject any input not conforming to the rules - what users might typically see as syntax error messages. Critically, it is the parser's job to protect the downstream software! If the parser’s validation is done incorrectly (such as using rules intended for some other dialect), then the downstream software can encounter unexpected data. Bad things can happen...

PDF’s specification implies some subtle parsing differences between the PDF body and content streams, between the PDF body and object streams, and between conventional cross-reference tables and cross-reference streams. Other PDF dialects are far more obvious - such as CMaps and Type 4 PostScript calculator functions.

The parser for the PDF body portion of a conventional PDF file expects to encounter direct objects between the keywords obj and endobj. It expects to encounter any number of basic PDF objects expressed in PDF’s COS syntax: integers, real numbers, booleans (true / false keywords), names, strings, arrays (demarcated by [ and ] tokens), dictionaries (demarcated by << and >> tokens), streams (using stream and endstream keywords), the special null object, and indirect references. This is what most people think of when they think of a PDF parser.

Parsing content streams

VSCode screenshot.

In contrast to the PDF body, PDF’s content streams are where graphical representations - the stuff that appears on the PDF page - are defined. Content streams also use COS syntax to define the graphic operators and their operands. These operands can be almost any of the basic PDF objects; the exceptions are other streams (obviously, by their very nature as streams must always be indirect objects!) and, more subtly, indirect references.

ISO 32000-2:2020 (PDF 2.0) clause 7.8.2 Content streams explicitly states: “Indirect objects and object references shall not be permitted at all”. The correct method to refer to an object is via named resources specified in a Resources dictionary. But there’s more…

A commonly asked question is “Can a content stream contain a PDF dictionary object? And, if so, where?”. The answer to the first question is a definitive yes! The only explicitly defined use of a direct PDF dictionary in a content stream is the property operand of certain marked-content operators as per Table 352 in ISO 32000-2:2020. Since the property dictionary is defined very flexibly, it may also contain nested direct PDF dictionaries. But this is not the only place a PDF dictionary might occur; we must also consider where PDF’s specification permits arbitrary objects in content streams, such as private data, as in this example:

/Span << 
   /MCID 27 
   /Lang (en-US) 
   /ACME_Private << /SomeDict << /Type /FooBar >> >> 
>> BDC
…
EDC

An inline image object is a dictionary-like data structure providing key/value pairs, but it does not start or end with << or >> (inline images use the BI and ID operators, respectively). However, the value of a private key in an inline image dictionary might be a direct PDF dictionary, including nested direct PDF dictionaries:

… 
BI
  /Width  20
  /Height 10
  /BPC  8
  /CS /DeviceRGB
  /ACME_Private << /SomeDict << /Type /FooBar >> >>
  /Filter [/FlateDecode] 
  /Length … 
ID
… binary FLATE-compressed image data … 
EI
…

PDF also defines the compatibility operators, BX and EX (see Table 33, ISO 32000-2:2020). Within a compatibility section, “unrecognised operators (along with their operands) shall be ignored without error until the balancing EX operator is encountered”. This means that a parser encountering a compatibility section needs to support encountering any basic PDF object including direct PDF dictionaries, except (as per clause 7.8.2) PDF stream objects and indirect references.

… 
BX
  << /SomeKey /SomeValue /InnerDict << /Key1 /Value1 >> >> newoperator 
EX
…

So if a PDF body parser also performs PDF content stream parsing then the following key differences in the respective dialects must be accounted for:

Content streams cannot have indirect references, and
Content streams can have nested direct PDF dictionaries in certain places, and
The PDF file body cannot have graphic operators/operands.

Parsing object streams

PDF 1.5 introduced Object streams (clause 7.5.7 in ISO 32000-2:2020) as an alternate method of storing indirect objects that would otherwise make up the file body of a PDF. By utilizing object streams with compression filters, it’s possible to reduce overall PDF file size. Although PDF’s specification prohibits certain PDF objects from being in object streams (including other streams, the document’s encryption dictionary, objects with non-zero generation numbers, the Length entry of an object stream, and several new requirements arising from errata to avoid potentially endless processing loops), parsing an object stream can be considered a dialect of the PDF file body. This parsing must also be able to enforce such restrictions on object streams as it is the job of the parser to protect downstream software, as mentioned above, from both accidental and malicious invalid inputs.

NOTE 7 (2020) Previous editions of this specification incorrectly stated that white-space was required between objects. This document corrects the text to clarify that processing of each object in an object stream starts at the specified byte offset in the decompressed stream and ends prior to the byte offset of the next object or when the end of stream is encountered.

NOTE 8 A compressed dictionary or array can contain indirect references.

An object in an object stream shall not consist solely of an object reference.

Object streams do not use the obj and endobj keywords, and instead use additional data (pairs of integers) at the start of the object stream that specifies the object number and starting byte offset for each object in the object stream. Consistent parsing of object streams by all PDF processors is the only way to ensure interoperability, which can only be achieved through a common understanding. The 2020 edition of ISO 32000-2 added NOTE 7 (shown above) to provide guidance on the correct way to partition object streams prior to parsing.

If you wish to test your PDF parser, targeted hand-crafted PDF files are freely available from GitHub. This repo includes additional test cases targeting other potential dialect confusion.

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0079. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Approved for public release.

Featured articles

Discover pdfa.org

Key resources

Get involved

Parsing PDF’s dialects

Parsing content streams

Parsing object streams