PDF Association logo

Discover pdfa.org

Key resources

Get involved

How do you find the right PDF technology vendor?
Use the Solution Agent to ask the entire PDF communuity!
The PDF Association celebrates its members’ public statements
of support
for ISO-standardized PDF technology.

Member Area

Working together at PDF Days Europe 2022; picture of Tim Allison

Presented at PDF Days Europe 2022
( 2022, Sep )

Making more sense of PDF structures in the wild at scale

Progress and outcomes of analysis on 8 million PDFs gathered from Common Crawl

Excerpt: This is a follow-on talk from our 2021 PDF Days presentation on the File Observatory. Our team built the File Observatory to support Defense Advanced Research Projects Agency (DARPA)’s SafeDocs program by enabling parser developers to understand features of PDFs in the wild at scale. In the first part of our presentation, we’ll offer an overview of the capabilities of the observatory, from gathering files, to running numerous parsers on the files, to searching and analyzing the features extracte … Read more
About the presenter(s)

Tim has been working in content/metadata extraction (and evaluation), advanced search and relevance tuning for nearly 20 years. Tim is the founder of Rhapsode Consulting LLC, and he currently works … Read more


Tim Allison
Jet Propulsion Laboratory

Description

This is a follow-on talk from our 2021 PDF Days presentation on the File Observatory. Our team built the File Observatory to support Defense Advanced Research Projects Agency (DARPA)’s SafeDocs program by enabling parser developers to understand features of PDFs in the wild at scale. In the first part of our presentation, we’ll offer an overview of the capabilities of the observatory, from gathering files, to running numerous parsers on the files, to searching and analyzing the features extracted by the parsers. In the second part, we’ll detail progress on building and packaging the “observatory in a box” for transition. In the third part, we’ll present some of the findings on an analysis of roughly 8 million PDFs from Common Crawl. This section will include an analysis of parser warnings, exceptions and errors on the set of files as well as a presentation of statistical summaries of PDF features, including versions, languages, creator tools/producers and more interesting syntactic features.


WordPress Cookie Notice by Real Cookie Banner