PDF Association logo

Discover pdfa.org

Key resources

Get involved

How do you find the right PDF technology vendor?
Use the Solution Agent to ask the entire PDF communuity!
The PDF Association celebrates its members’ public statements
of support
for ISO-standardized PDF technology.

Member Area

Screenshot of a hex editor examining the world's smallest PDF.

The smallest possible (valid) PDF

What’s the smallest possible fully compliant and validatable PDF file? We set out to answer the question.
About the author: Peter Wyatt is the PDF Association’s CTO and an independent technology consultant with deep file format and parsing expertise. A developer and researcher working on PDF technologies for more than … Read more
Peter Wyatt
Peter Wyatt
January 24, 2025

Article


Peter Wyatt


January 24, 2025

Article



Over the years we’ve seen various challenges to create the smallest possible valid PDF file. Examples include Mathias Bynens’ experiments, this closed StackOverflow thread, and Ange Albertini’s tweets from 2017 . However, such challenges have tended to focus on specific implementations and/or leverage nuances in recognizing (and accepting) malformed input data as valid PDF.

BUSINESS NOTE

Value is in the eye of the beholder! PDF can be fun, so this one is for the geeks!

BUSINESS NOTE

Value is in the eye of the beholder! PDF can be fun, so this one is for the geeks!

Common tricks include:

  • Not including dictionary keys that are specified as required (e.g. Length keys on streams; page Resources).
  • Not including any cross-reference information.
  • Using inline dictionaries instead of the required indirect references (and thereby reducing the object count).
  • Not including required elements of PDF file structure (such as the startxref keyword, %%EOF marker or required end-of-line markers).
  • Relying on permissive and - let’s face it - bad implementations.
  • Use of abbreviations for some key names (such as compression filter names).

These tiny but malformed files often won’t work in stricter PDF applications that choose to ignore such malformed inputs (or are simply incapable of handling these errors).

For this article, we took a slightly different approach. We focused on creating the smallest possible fully compliant and validatable PDF file that works with any PDF software that complies with the latest definition of PDF (ISO 32000-2:2020).

So what “tricks” help with minimizing file size while maintaining strict compatibility with the specification?

  • Always use the single-byte end-of-line (EOL) bytes CARRIAGE RETURN (0Dh) or LINE FEED (0Ah).
  • Avoiding whitespace wherever possible, since adjacent tokens can leverage the special delimiter characters defined in ISO 32000-2, 7.2.3.
  • Using integers instead of real numbers (there is no need for a PERIOD or any decimal digits).
  • Using small integers (fewer digits) without leading zeros or a sign character.
  • Avoiding optional keys in dictionaries.
  • Using default values, as defaults do not need to be explicit.
  • Eliminating comments that are not required by the PDF specification.
  • Keeping objects inline as much as possible, since each new object also requires a cross-reference table entry (a 20-byte overhead for conventional cross-reference entries).
  • Leveraging the benefits of each PDF version (but without violating any PDF versioning rules!).

The smallest possible valid PDF

NOTE: All links to PDFs in this article are download links

Both these PDFs are text-based and can be viewed in a text editor (such as Visual Studio Code with the PDF Association’s extension for PDF syntax). Depending on other editors used, Microsoft Windows users may have issues since Windows uses a CR/LF pair for EOL.

PDF developers know that PDF 1.5 introduced object streams and cross-reference streams with the explicit goal of reducing file size – but does this hold true for such a trivial file? Object streams no longer use “<object number><generation number> obj” and “endobj” keywords, but have an opportunity cost of object numbers and a relative byte offset (see ISO 32000-2:2020, 7.5.7). Cross-reference streams replace the conventional 20-byte cross-reference section entries with binary data so have the potential for much greater savings (see ISO 32000-2:2020, 7.5.8), although the opportunity cost is now additional dictionary entries in the cross-reference stream dictionary.

However, when adding binary data to PDF, ISO 32000, 7.5.2 requires the addition of a binary file marker comment:

If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be immediately followed by a comment line containing at least four binary characters—that is, characters whose codes are 128 or greater. This ensures proper behaviour of file transfer applications that inspect data near the beginning of a file to determine whether to treat the file’s contents as text or as binary.

The PDF specification is not entirely clear in how “binary” is defined - whether it includes any unprintable ASCII control character or just bytes that are 128 or greater in value. For this exercise, we can be conservative and include whenever an unusual ASCII control character is included. This comment will always be 6 bytes: the % start-of-comment marker plus the 4 binary bytes and a single EOL byte to end the comment. However, this also has the unfortunate consequence of requiring the startxref offset to now need 2 digits as the first object is shifted lower in the file!

The smallest-possible-pdf-1.5.pdf (PDF 1.5) uses both an uncompressed (raw bytes) object stream and an uncompressed cross-reference stream - and is 342 bytes, which is 30 bytes larger than the PDF 1.0 example with the equivalent PDF objects.

NOTE: some PDF software does not support uncompressed object streams and/or uncompressed cross-reference streams. This is unfortunate as the PDF specification does not require compression; it is far easier to learn and debug PDF when working with uncompressed (and thus, human-readable) data.

Let’s see if using a cross-reference stream will be less than the 89 bytes of a conventional cross-reference section – a significant contributor to total file size. This will require a few changes to remain 100% valid:

  • The PDF version must be increased to PDF 1.5 or later - but if a later PDF version is used then all additional requirements introduced in that version must also be met.
  • A new object must be added for the cross-reference stream itself.
  • The conventional trailer dictionary entries must be moved to the cross-reference stream dictionary along with all the required cross-reference stream dictionary keys.
  • The startxref byte offset is changed to point to the new cross-reference stream. Ensuring this stream is at the start of the file means a smaller number, and thus fewer digits.
  • The cross-reference stream data can utilize default values in the W array, since all objects have a generation number of 0 (ISO 32000-2:2020, Table 17, third value in the W array).
  • If all PDF objects start in the first 255 bytes the cross-reference stream data can use a single byte for file offsets (the middle entry in the W array).
  • ISO 32000-2:2020 does not explicitly state whether PDF files without incremental updates and that use cross-reference streams are required to explicitly initialize the free list of objects or not (i.e. have an equivalent of the first line in conventional cross-reference tables: “0000000000 65535 f”). Such initialization requires the third entry in the W array to be 2 to represent 65,535 (FFFFh), but this approach results in excessive opportunity cost, as all other objects must then have a generation number when they could otherwise use the default.

So the experiment worked! The smallest-possible-pdf-1.5-xrefstm-only.pdf is now only 294 bytes long, a saving of 18 bytes! And this is without using any compression, so maybe we can do even better…!

NOTE: some text editors, including Visual Studio, do not support arbitrary binary data such as occurs in PDFs with cross-reference streams. We suggest using a hex editor (such as HxD or 010 Editor) to ensure the raw binary bytes are not accidentally transformed. The cross-reference table data can be understood using (for example) the QPDF command line utility: qpdf --show-xref.

TIP: using the AsciiHexDecode filter on cross-reference stream data can make understanding and editing much easier since the (decimal) byte offsets only need to be converted to/from hex which can then be easily edited using a text editor (assuming no other binary data is in the PDF). The use of whitespace and EOLs in AsciiHexDecode also allows for a neat layout (e.g. one line of whitespace-separated hex fields per PDF object). However, since using AsciiHexDecode doubles the number of bytes required; this does not help us in discovering the smallest possible valid PDF file!
TIP: using the AsciiHexDecode filter on cross-reference stream data can make understanding and editing much easier since the (decimal) byte offsets only need to be converted to/from hex which can then be easily edited using a text editor (assuming no other binary data is in the PDF). The use of whitespace and EOLs in AsciiHexDecode also allows for a neat layout (e.g. one line of whitespace-separated hex fields per PDF object). However, since using AsciiHexDecode doubles the number of bytes required; this does not help us in discovering the smallest possible valid PDF file!

When compressing stream data the specification requires that a new Filter key be added to each stream dictionary with the full filter name. Our best choices are either FlateDecode (“FLATE”) or LZWDecode (“LZW”). Although we might consider choosing LZWDecode because its name is 2 bytes shorter, using FLATE gives us slightly better data compression, even though it requires adding 19 bytes (/Filter/FlateDecode)! Any compression would need to save 20 (or more) bytes versus the raw streams used in the smallest-possible-pdf-1.5.pdf to result in any size advantage. As our file’s XRef stream is only 24 bytes in raw form, this simply cannot happen. But our uncompressed object stream is 136 bytes…

Using the zlib-flate command line utility (part of QPDF) with maximum compression it is possible to extract and compress just the 136 bytes of object stream data:

zlib-flate -compress=9 < objects.txt  > objects.flate

The resulting FLATE-compressed data is 113 bytes – saving 23 bytes, or 3 bytes when merged back into a PDF with /Filter/FlateDecode. The smallest-possible-pdf-1.5-flate.pdf is now just 338 bytes which is 4 bytes smaller than when using an uncompressed object stream!

If the PDF 2.0 overhead of the trailer ID entry was added as a key to the XRef dictionary, it would add 41 bytes, since all bytes would be stored in object 1 outside the object stream. However since these “nano-PDFs” are not encrypted, ISO 32000-2:2020 Table 15 permits the ID entry to be an indirect reference, allowing for a new object with the 2-element array of strings to be located within the object stream. Obviously, if the object stream is uncompressed this approach adds significant overhead for a new object (e.g. another cross-reference entry) plus an indirect reference (<object number> <generation number> R) – see the smallest-possible-pdf-2.0-stms.pdf at “massive” 406 bytes.

However, if the same FLATE compression technique is used on the object stream data, the data reduces from 180 to 136 bytes and, when merged back into the PDF, the overall file size is 381 bytes, which is larger than the conventionally structured PDF 2.0 file (see the smallest-possible-pdf-2.0.pdf vs. the smallest-possible-pdf-2.0-stms.pdf). Note that because of the new object, the raw cross-reference stream increased, but insufficiently to bother with a compression filter given the Filter name overhead. Download the files individually from the table, or in a zip archive.

Filename Size (bytes) PDF Version Features
smallest-possible-pdf-1.0.pdf 312 1.0 Pure text. 3 objects. Conventional cross-reference.
smallest-possible-pdf-1.5.pdf 342 1.5 Binary. 5 objects. Raw cross-reference stream. Raw object stream.
smallest-possible-pdf-1.5-flate.pdf 338 1.5 Binary. 5 objects. Raw cross-reference stream. FLATE compressed object stream.
smallest-possible-pdf-1.5-xrefstm-only.pdf 294 1.5 Binary. 4 objects. Raw cross-reference stream. Conventional body objects.
smallest-possible-pdf-2.0.pdf 353 2.0 Pure text. 3 objects. Conventional cross-reference. Trailer IDentry.
smallest-possible-pdf-2.0-stms.pdf 406 2.0 Binary. 6 objects. Raw cross-reference stream. Raw object stream. IDentry (indirect).
smallest-possible-pdf-2.0-stms-flate.pdf 381 2.0 Binary. 6 objects. Raw cross-reference stream. FLATE compressed object stream. ID entry (indirect).

These PDF files can be validated using various tools including:

The smallest possible fully valid file we achieved was smallest-possible-pdf-1.5-xrefstm-only.pdf at just 284 bytes without using any compression filters and avoiding object streams. This is a conservative outcome due to ambiguities in the PDF specification. What was surprising to us is that the next smallest PDF was “much” larger (+28 bytes). Obviously real-world files will achieve far greater size benefits by using object streams, cross-reference streams, and compression filters, but it was fun to see how small things can be while ensuring validity.


Ange Albertini’s tweetstorm exposing lax implementations!

WordPress Cookie Notice by Real Cookie Banner