Working together at PDF Days Europe 2022; picture of Tim Allison

Presented at PDF Days Europe 2022
( 2022, Sep )

Making more sense of PDF structures in the wild at scale

Progress and outcomes of analysis on 8 million PDFs gathered from Common Crawl

Session description

This is a follow-on talk from our 2021 PDF Days presentation on the File Observatory. Our team built the File Observatory to support Defense Advanced Research Projects Agency (DARPA)’s SafeDocs program by enabling parser developers to understand features of PDFs in the wild at scale. In the first part of our presentation, we’ll offer an overview of the capabilities of the observatory, from gathering files, to running numerous parsers on the files, to searching and analyzing the features extracted by the parsers. In the second part, we’ll detail progress on building and packaging the “observatory in a box” for transition. In the third part, we’ll present some of the findings on an analysis of roughly 8 million PDFs from Common Crawl. This section will include an analysis of parser warnings, exceptions and errors on the set of files as well as a presentation of statistical summaries of PDF features, including versions, languages, creator tools/producers and more interesting syntactic features.

Tim Allison
Jet Propulsion Laboratory

Slides download: https://pdfa.org/wp-content/uploads/2022/05/1415-Allison.pdf

Featured articles

Discover pdfa.org

Key resources

Get involved

Making more sense of PDF structures in the wild at scale

Session description