Making more sense of PDF structures in the wild at scale
Progress and outcomes of analysis on 8 million PDFs gathered from Common Crawl
Excerpt: This is a follow-on talk from our 2021 PDF Days presentation on the File Observatory. Our team built the File Observatory to support Defense Advanced Research Projects Agency (DARPA)’s SafeDocs program by enabling parser developers to understand features of PDFs in the wild at scale. In the first part of our presentation, we’ll offer an overview of the capabilities of the observatory, from gathering files, to running numerous parsers on the files, to searching and analyzing the features extracte … Read moreAbout the presenter(s)
Tim has been working in content/metadata extraction (and evaluation), advanced search and relevance tuning for nearly 20 years. Tim is the founder of Rhapsode Consulting LLC, and he currently works … Read more
Description
This is a follow-on talk from our 2021 PDF Days presentation on the File Observatory. Our team built the File Observatory to support Defense Advanced Research Projects Agency (DARPA)’s SafeDocs program by enabling parser developers to understand features of PDFs in the wild at scale. In the first part of our presentation, we’ll offer an overview of the capabilities of the observatory, from gathering files, to running numerous parsers on the files, to searching and analyzing the features extracted by the parsers. In the second part, we’ll detail progress on building and packaging the “observatory in a box” for transition. In the third part, we’ll present some of the findings on an analysis of roughly 8 million PDFs from Common Crawl. This section will include an analysis of parser warnings, exceptions and errors on the set of files as well as a presentation of statistical summaries of PDF features, including versions, languages, creator tools/producers and more interesting syntactic features.