Perfecting PDF Lexical Analysis
This PDF file tests parsing valid combinations of adjacent PDF tokens both in the body of a PDF as well as in PDF content streams.A lexical analyzer (or, more simply, a lexer) is software deep inside every PDF parser that has the job of converting the input stream of bytes into the discrete tokens that make up the COS input grammar used by PDF. This involves correctly processing all PDF token delimiters and whitespace characters so that the logical sequence of tokens representing keywords, literals, identifiers and PDF objects can then be processed downstream by the parser
If a lexer is not correct, then a PDF parser will either see a meaningless jumble of tokens that it does not understand and PDF processing will fail, or possibly a valid but entirely different sequence of tokens than the PDF writing software intended. Thus it is critically important that all lexers follow the PDF specification precisely!
The many snippets of PDF shown as examples throughout all ISO PDF specifications use generous quantities of whitespace, indenting and line breaks to aid readability and understanding by humans. Although entirely valid, the formal specification for PDF and the COS syntax does not require all this additional whitespace. Thus a “compacted syntax” that is less human friendly but still entirely valid should be correctly parsed by all lexers (as a lexer does not care about human friendliness!). This is mentioned explicitly in clause 7.2.3 of ISO 32000:
The examples in this document use a convention that arranges tokens into lines. However, the examples’ use of white-space for indentation is purely for clarity of exposition and need not be included in practical use.
Here is one of the first PDF snippets shown in ISO 32000-2 (the example of a page object below Table 31):
3 0 obj <</Type /Page /Parent 4 0 R /MediaBox [0 0 612 792] /Resources <</Font <</F3 7 0 R /F5 9 0 R /F7 11 0 R >> >> /Contents 12 0 R /Annots [23 0 R 24 0 R ] >> endobj
And here is what the same snippet would look like if all unnecessary whitespace (which includes line breaks) were removed:
3 0 obj<</Type/Page/Parent 4 0 R/MediaBox[0 0 612 792]/Resources<</Font<</F3 7 0 R/F5 9 0 R/F7 11 0 R>>>>/Contents 12 0 R/Annots[23 0 R 24 0 R]>>endobj
This may be a big difference for a human, but it is no different for a correctly implemented PDF lexer!
Over the years, the wording of the Character set clause (clause 7.2.2., ISO 3200-1:2008 and 7.2.3, ISO 32000-2:2017 and ISO 32000-2:2020) has been subtly revised. Fundamentally there have been no changes to the basic lexical rules of PDF, only corrections (such as recognizing that LEFT CURLY BRACE and RIGHT CURLY BRACE are not token delimiters in general PDF, but a hangover from PostScript) and clarifications (such as improved cross-referencing to other clauses and highlighting the double character constructs << and >> used by dictionaries).
When considered logically and based on a detailed reading of the full PDF specification, any adjacent pair of PDF tokens may or may not require whitespace as the token delimiter. This will depend on the specific rules for the characters (bytes) that comprise each token, and whether a specific delimiter character can be included or not.
A key point is that PDF does NOT always require whitespace between tokens and that various special characters are defined to be explicit token delimiters. It is common to encounter some PDF objects without whitespace in real-world PDF files:
- Adjacent PDF name objects with no gap between: /Type/XObject;
- PDF dictionaries when written as direct objects: <</SomeKey<</A/B>>>>;
- direct nested PDF arrays: [[/A/B][/C/D]];
- Adjacent PDF hex strings: <0a><0d>; and
- PDF literal strings: (cat)(sat)(mat).
However other characters are often not handled correctly as token delimiters, especially if they are context-dependent on the prior token. And this can lead to parser differentials, non-interoperable PDFs, or worse - parser crashes.
With support from the DARPA-funded SafeDocs program, the PDF Association CTO, Peter Wyatt, has created a detailed test matrix of all possible combinations of PDF token pairings and an associated test PDF file. The test PDF file tests parsing valid combinations of adjacent PDF tokens both in the body of a PDF as well as in PDF content streams and is available in the PDF Association GitHub repository.
The matrix document describes the 121 possible token pairings. This illustrates that in some cases multiple characters (bytes) can be used as the first character in a token, creating even more combinations of adjacent character pairings. For example, a real number can start with a sign - or + (i.e., a negative number such as -1.2), US-style decimal point . (as in .123 without any leading zero), or any of the digits 0 to 9 (such as the usual 4.56). Thus there are 3 test cases for when a PDF real number is the second token. This creates multiple adjacent compacted character pairings that are all valid token pairings after, for example, a PDF array end token (]) :
- ]-1.2,
- ]+1.2,
- ].12, and
- ]4.56.
Note that a correct rendering (visual appearance) of the test PDF does NOT guarantee that a PDF processor correctly processes tokens according to all the PDF rules! An analysis of the parsed token stream and confirmation of the constructed PDF objects against the test PDF file is required. This requires a software engineer familiar with the lexer/parser source code to perform this test.
By ensuring that all PDF lexers are fully and correctly implemented against the PDF standard, document interoperability, reliability and product robustness are improved. And everyone’s a winner.
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0079. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Approved for public release.