Arlington PDF Model
The Arlington PDF Model is a free and open-source machine-readable data model of all PDF objects. It was originally developed under the DARPA-funded SafeDocs program and is now available via GitHub. As of June 2023 it remains under active development.
Derived directly from the latest PDF 2.0 specification, ISO 32000-2:2020, and its resolved errata, the Arlington PDF Model is thus entirely vendor- and implementation neutral. It represents the full PDF document object model (DOM) in an easy-to-process structured definition (as a set of tab-separated TSV files) of all formally defined PDF objects (dictionaries, arrays and map objects) and their relationships beginning with the PDF file trailer using a simple text-based syntax and a small set of predicates (declarative statements about data integrity relationships).
The Arlington PDF Model does not replace ISO 32000-2, and must always be used in conjunction with the PDF 2.0 specification in order to fully understand the PDF DOM.
This resource is primarily intended for PDF developers to help them check their implementations and understanding, avoid malformed PDFs, and help bring PDF technology into alignment with the latest and most up-to-date definition of PDF.
The model was named “Arlington” in recognition of DARPA‘s contribution to advancing PDF technology (DARPA is headquartered in Arlington, Virginia).
Related resources
- Arlington PDF Model on GitHub
- Demystifying PDF through a machine-readable definition; Peter Wyatt, LangSec Workshop at IEEE Security & Privacy, May 27th and 28th, 2021 [Paper] [Talk Video]
- The Arlington PDF Model“[presentation], Peter Wyatt, PDF Association’s “PDF Days 2021” online event, Tuesday 28 Sept 2021.
- “Strategies for Testing PDF Files, PDF Days Europe 2022, Michael Demey, iText Group NV.
- BFO PDF Library 2.27.2 – introducing the Arlington Model, blog post, 5 Dec 2022. Mike Bremford, BFO.
- DARPA SafeDocs: an approach to secure parsing and information interchange formats, [30 minute presentation Video] Sergey Bratus, Microsoft Research Summit 2021, 20 October 2021.
- veraPDF Arlington-based PDF checker development preview
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0079. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Approved for public release.
The Arlington PDF Model is a free and open-source machine-readable data model of all PDF objects.