Stressful PDF Corpus
The "Issue Tracker" corpus of stressful PDF files was originally developed under the DARPA-funded "SafeDocs" program as discussed on pdfa.org.
If a “stressful PDF” can be considered as any file that causes problems for a parser, then looking into the problems faced by diverse parsers can be a great learning experience.
This corpus now includes bug attachment data from 35 issue tracker repositories across 32 PDF technologies, comprising 31 GB and over 32,500 stressful PDF files. It was collated during November 2020 as part of the DARPA-funded "SafeDocs" project.
These issue trackers now span a broad variety of PDF technologies written in a wide range of programming languages. Due to the size, we have packaged the corpus into six compressed tar balls (.tgz files) each containing the data from multiple repositories to make downloading more convenient. These .tgz files are available from https://labs.pdfa.org/stressful-corpus/.
PDF technology | Folder | Issue Tracker URL | # files | Size | .tgz file) |
---|---|---|---|---|---|
Android PDF Viewer (Java) | androidpdfviewer | https://github.com/barteksc/AndroidPdfViewer | 13 | 3.2M | 5 |
Cairo | cairo | https://bugs.freedesktop.org | 166 | 33M | 5 |
Cairo | cairo-gitlab | https://gitlab.freedesktop.org/cairo/cairo | 29 | 12M | 6 |
DeJaVu | dejavu | https://bugs.freedesktop.org | 39 | 2.7M | 5 |
eSignature DSS | DSS | https://ec.europa.eu/cefdigital/tracker/projects/DSS | 243 | 89M | 5 |
GNOME Evince | evince | https://gitlab.gnome.org/GNOME/evince | 241 | 591M | 6 |
Apache FOP | FOP | https://issues.apache.org/jira/projects/FOP | 808 | 157M | 5 |
GhostScript (C/C++) | GHOSTSCRIPT | https://bugs.ghostscript.com/ | 5,458 | 5.6G | 2 |
Snappy PDF (laravel, PHP) | laravel-snappy | https://github.com/barryvdh/laravel-snappy | 5 | 1.8M | 5 |
Libre Office | LIBRE_OFFICE | https://bugs.documentfoundation.org/ | 5,572 | 1.4G | 4 |
libvips image library | libvips | https://github.com/libvips/libvips | 18 | 384M | 5 |
Mozilla | MOZILLA | https://bugzilla.mozilla.org/ | 6,879 | 3.9G | 3 |
Apache Nutch | NUTCH | https://issues.apache.org/jira/projects/NUTCH | 13 | 976K | 5 |
OCRmyPDF (Python) | ocrmypdf | https://github.com/jbarlow83/OCRmyPDF | 205 | 501M | 5 |
Apache OpenOffice.org | OOO | https://bz.apache.org/ooo | 1,564 | 253M | 4 |
OpenPDF (Java) | openpdf | https://github.com/LibrePDF/OpenPDF | 32 | 3.2M | 5 |
parsr (JS) | parsr | https://github.com/axa-group/Parsr | 28 | 12M | 5 |
Mozilla pdf.js (JS) | pdf.js | https://github.com/mozilla/pdf.js | 2,368 | 4.5G | 4 |
Apache PDFBOX (Java) | PDFBOX | https://issues.apache.org/jira/projects/PDFBOX | 3,832 | 2.7G | 1 |
pdfcpu (Go) | pdfcpu | https://github.com/pdfcpu/pdfcpu | 100 | 218M | 5 |
Chromium PDFium (C++) | PDFIUM | https://bugs.chromium.org/p/pdfium/issues/list | 379 | 212M | 5 |
pdfkit (JS) | pdfkit | https://github.com/foliojs/pdfkit | 38 | 35M | 5 |
pdfminer.six (Python) | pdfminer.six | https://github.com/pdfminer/pdfminer.six | 123 | 106M | 5 |
PikePDF (Python) | pikepdf | https://github.com/pikepdf/pikepdf | 23 | 30M | 5 |
Apache POI | POI | https://bz.apache.org/bugzilla/ | 11 | 940K | 5 |
Poppler (C/C++) | poppler | https://bugs.freedesktop.org | 1,585 | 6.2G | 5 |
Poppler (C/C++) | poppler-gitlab | https://gitlab.freedesktop.org/poppler/poppler | 463 | 926M | 6 |
Prawn PDF (Ruby) | prawn | https://github.com/prawnpdf/prawn | 53 | 69M | 5 |
qpdf (C++) | qpdf | https://github.com/qpdf/qpdf | 111 | 324M | 5 |
react-pdf (JS) | react-pdf | https://github.com/diegomura/react-pdf | 14 | 2.2M | 5 |
Redhat Linux | REDHAT | https://bugzilla.redhat.com/ | 1,712 | 1.3G | 5 |
Sumatra PDF (C/C++) | sumatrapdf | https://github.com/sumatrapdfreader/sumatrapdf | 320 | 788M | 5 |
tabula | tabula | https://github.com/tabulapdf/tabula | 2 | 172K | 5 |
tabula-java (Java) | tabula-java | https://github.com/tabulapdf/tabula-java | 77 | 45M | 5 |
Apache TIKA (Java) | TIKA | https://issues.apache.org/jira/projects/TIKA | 155 | 156M | 2 |
TOTAL: | 35 | - | 32,679 | 31G | - |
This BACKGROUND.txt file describes the overall issue tracker corpus and how the data was collated.
Please provide your feedback or comments in the PDF TWG.
The PDF Association again wishes to thank the NASA JPL and Apache Tika teams, and particularly Dr. Tim Allison, for their efforts in preparing the technology and collating the data. We also wish to thank Maruan Sahyoun of PDF Association member FileAffairs GmbH, part of the Apache PDFBox team, for previously hosting the “Issue Tracker” PDF corpus.
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0079. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Approved for public release.