PDF Association logo

Discover pdfa.org

Key resources

Get involved

How do you find the right PDF technology vendor?
Use the Solution Agent to ask the entire PDF communuity!
The PDF Association celebrates its members’ public statements
of support
for ISO-standardized PDF technology.

Member Area

Kudu Dynamics SafeDocs logo.

UNSAFE-DOCS: a new SafeDocs Corpus including the Voice of the Offense

The PDF, ICC, NITF and video datasets used in SafeDocs’ performer hackathons and evaluations are now available for download from digitalcorpora.org.
About the author: The PDF Association staff delivers a vendor-neutral platform for PDF’s stakeholders, facilitating the development of open specifications and ISO standards for PDF technology. The staff are located in Germany, the … Read more
PDF Association staff

PDF Association staff
January 10, 2024

Article


Print Friendly, PDF & Email

Following the release of the curated CC-MAIN-2021-31-PDF-UNTRUNCATED corpus, the UNSAFE-DOCS corpus contains over 5.0 million files collected by a team at NASA’s Jet Propulsion Laboratory (JPL) or synthetically generated by Kudu Dynamics’ “Voice of the Offense” (VoO) team for the Defense Advanced Research Project Agency (DARPA)’s SafeDocs Program.

The common-crawl component of the UNSAFE-DOCS corpus (over 3.9 million PDF files as collected by a team at NASA’s Jet Propulsion Laboratory (JPL)) includes some truncated PDF files, as described in the paper “Building a Wide Reach Corpus for Secure Parser Development" by Allison et al ([slides] [paper]). This component was utilized within the SafeDocs program as the starting point for many generated files, as discussed below, and as the Program’s evaluation corpus. It was later expanded and improved to become CC-MAIN-2021-31-PDF-UNTRUNCATED.

The UNSAFE-DOCS datasets

The UNSAFE-DOCS corpus contains file and streaming formats used during the SafeDocs program research, Hackathon, and Evaluation events. The folders are within the set of zip files posted on digitalcorpora.org:

common-crawl

  • This collection of common-crawl-*.zip files contains more than 3.9 million PDFs from Common Crawl. Note that some PDFs were truncated by Common Crawl when fetched.

corpora-pdf

  • This collection of corpora-pdf-*.zip files contains over 1.5 million "unsafe" PDF files
  • The corpus was designed to achieve the necessary diversity and complexity, providing a representative baseline for the assessment of both performer-created parsers and extant PDF parser security. The PDF files leverage three unique sources developed by the SafeDocs TA3 performer, Kudu Dynamics, for the creation of malformed PDF files that may exhibit potentially dangerous behavior in the anti-pattern known as a “shotgun parser”:
    • Synthetic and real-world PDF files that resulted in unique call paths through various extant parsers identified by Google’s AFL security-oriented fuzzer;
    • Synthetically modified real-world PDF files malformed by a directed PDF semantic- and grammar-aware tool; and
    • Synthetically malformed real-world PDF files with byte sequences randomly modified by a PDF syntax-aware tool.

corpora-video

  • This collection of corpora-video-*.zip files contains corpora for several streaming video formats.

corpora-icc

  • corpora-icc.zip contains corpora for ICC color profiles

corpora-nitf-01 and corpora-nitf-02

  • corpora-nitf-01.zip and corpora-nitf-01.zip files contains corpora for the National Imagery Transmission Format (NITF) file format

UNSAFE-DOCS corpora structure

The folder structure of the UNSAFE-DOCS corpus consists of the following:

  • common-crawlThis folder contains an initial extraction of 3,953,920 PDFs from Common Crawl. Note that some PDFs were truncated by Common Crawl when fetched.
  • corpora-pdf
    • corpora-pdf/hackathonOnePrepCorpus used to prepare for the study of the PDF format and PDF parsers before SafeDocs Hackathon 1 (December, 2019).
    • corpora-pdf/hackathonOneCorpus used for the study of the PDF format and PDF parsers during SafeDocs Hackathon 1 (December, 2019).
    • corpora-pdf/evalOneCorpus used for the study of the PDF format and evaluation of safe PDF parsers during SafeDocs Evaluation 1 (March, 2020).
    • corpora-pdf/hackathonTwoCorpus used for the study of the PDF format and PDF parsers during SafeDocs Hackathon 2 (July, 2020).
    • corpora-pdf/evalTwo Corpus used for the study of the PDF format and evaluation of safe PDF parsers during SafeDocs Evaluation 2 (October, 2020).
    • corpora-pdf/hackathonThreeCorpus used for the study of the PDF format and PDF parsers during SafeDocs Hackathon 3 (March, 2021).
    • corpora-pdf/evalThreeCorpus used for the study of the PDF format and evaluation of safe PDF parsers during SafeDocs Evaluation 3 (June, 2021).
    • corpora-pdf/evalFourCorpus used for the study of the PDF format and evaluation of safe PDF parsers during SafeDocs Evaluation 4 (March, 2022).
    • corpora-pdf/testPDFsCorpus used for simple PDF testing of PDF parsers.
    • corpora-pdf/TA3Corpus synthetically generated with various stressful malforms to study the PDF format and PDF parsers.
    • corpora-pdf/issue_trackersCorpus of PDF files retrieved from issue trackers for various PDF parsers. This data has been formalized into the previously-announced Stressful “Issue Tracker” corpus.
    • corpora-pdf/common_crawl_cminCorpus of files from the common_crawl corpus that was minimized via afl-cmin and the poppler:pdftoppm binary.corpora-pdf/govdocs_cminCorpus of files from the Govdocs1 corpus that was minimized via afl-cmin and the poppler:pdftoppm binary.
  • corpora-icc
    • corpora-icc/hackathonFourCorpus used to study the ICC format and ICC parsers during SafeDocs Hackathon 4 (Nov, 2021).
    • corpora-icc/common_crawlCorpus used to study the ICC format.
    • corpora-icc/common_crawl_iccsCorpus used to study the ICC format.
    • corpora-icc/TA3Corpus synthetically generated with various stressful malforms to study the ICC format
  • corpora-nitf
    • corpora-nitf/hackathonFiveCorpus used to study the NITF format and NITF parsers during SafeDocs Hackathon 5 (Nov, 2022).
    • corpora-nitf/nitfCorpus used to study the NITF format and NITF parsers.
    • corpora-nitf/nitf-bestofFinal collection of the best NITF corpus with stressful malforms.
  • corpora-video
    • corpora-video/hackathonThreeCorpus used for the study of the MPEG format and evaluation of safe MPEG parsers during SafeDocs Hackathon 3.
    • corpora-video/evalFourCorpus used for the study of the MPEG format and evaluation of safe MPEG parsers during SafeDocs Evaluation 4.
    • corpora-video/evalFourInfo Information on the corpus used for the study of the MPEG format and evaluation of safe MPEG parsers during SafeDocs Evaluation 4.
    • corpora-video/evalSDF Packet captures of a streaming file transfer used for the evaluation of safe parsers during SafeDocs Evaluation 4.
    • corpora-video/TA4Corpus used for the study of the MPEG format and evaluation of safe MPEG parsers.
    • corpora-video/common_crawl_mpegsCorpus used for the study of the MPEG format and evaluation of safe MPEG parsers

Credits

This dataset was gathered by a team at NASA’s Jet Propulsion Laboratory (JPL), California Institute of Technology while supporting the Defense Advanced Research Project Agency (DARPA)’s SafeDocs Program. The JPL team included Chris Mattmann (PI), Wayne Burke, Dustin Graf, Tim Allison, Ryan Stonebraker, Mike Milano, Philip Southam, and Anastasia Menshikova.

The JPL and Kudu Dynamics teams collaborated with Peter Wyatt, the Chief Technology Officer of the PDF Association and PI on the SafeDocs program, in the design and documentation of this corpus. The JPL team and PDF Association would like to thank Simson Garfinkel and the Digital Corpora Project for taking ownership of this dataset and publishing it. Our thanks are extended to the Amazon Open Data Sponsorship Program for enabling this large corpus to be free and publicly available as part of the Digital Corpora Project initiative.

Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.

The research was carried out at the NASA (National Aeronautics and Space Administration) Jet Propulsion Laboratory, California Institute of Technology under a contract with the Defense Advanced Research Projects Agency (DARPA) SafeDocs program. Government sponsorship acknowledged.


This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0079. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Approved for public release.

WordPress Cookie Notice by Real Cookie Banner