Stressful PDF corpus grows!
The original PDF Issue Tracker corpus generated a lot of interest from the PDF technical community; now version 2 of the “Issue Tracker” corpus of stressful PDF has landed just 2 months later!In early September I announced the initial release of the “Issue Tracker” stressful corpus of PDF files that was developed under the DARPA-funded “SafeDocs” program. This has generated a lot of interest in the PDF technical community, and we have received some great feedback from many early adopters.
Now version 2 of the “Issue Tracker” corpus of stressful PDF has landed just 2 months later!
Some of these files are the devil’s work https://t.co/3BgDHw8qiu
— Victor P. (@PDFVic) November 9, 2020
From the original 11 issue tracker repositories the corpus now includes bug attachment data from 35 issue tracker repositories across 32 PDF technologies, comprising 31GB and over 32,500 stressful PDF files.
This large increase in issue trackers now spans a broader variety of PDF technologies written in a wider range of programming languages and, as one might expect, follows some technologies between issue trackers. Due to this increase in size, we have re-packaged the corpus into six compressed tar balls (.tgz files) each containing the data from multiple repositories to make downloading more convenient.
PDF technology | Folder | Issue Tracker URL | # files | Size | .tgz file) |
---|---|---|---|---|---|
Android PDF Viewer (Java) | androidpdfviewer | https://github.com/barteksc/AndroidPdfViewer | 13 | 3.2M | 5 |
Cairo | cairo | https://bugs.freedesktop.org | 166 | 33M | 5 |
Cairo | cairo-gitlab | https://gitlab.freedesktop.org/cairo/cairo | 29 | 12M | 6 |
DeJaVu | dejavu | https://bugs.freedesktop.org | 39 | 2.7M | 5 |
eSignature DSS | DSS | https://ec.europa.eu/cefdigital/tracker/projects/DSS | 243 | 89M | 5 |
GNOME Evince | evince | https://gitlab.gnome.org/GNOME/evince | 241 | 591M | 6 |
Apache FOP | FOP | https://issues.apache.org/jira/projects/FOP | 808 | 157M | 5 |
GhostScript (C/C++) | GHOSTSCRIPT | https://bugs.ghostscript.com/ | 5,458 | 5.6G | 2 |
Snappy PDF (laravel, PHP) | laravel-snappy | https://github.com/barryvdh/laravel-snappy | 5 | 1.8M | 5 |
Libre Office | LIBRE_OFFICE | https://bugs.documentfoundation.org/ | 5,572 | 1.4G | 4 |
libvips image library | libvips | https://github.com/libvips/libvips | 18 | 384M | 5 |
Mozilla | MOZILLA | https://bugzilla.mozilla.org/ | 6,879 | 3.9G | 3 |
Apache Nutch | NUTCH | https://issues.apache.org/jira/projects/NUTCH | 13 | 976K | 5 |
OCRmyPDF (Python) | ocrmypdf | https://github.com/jbarlow83/OCRmyPDF | 205 | 501M | 5 |
Apache OpenOffice.org | OOO | https://bz.apache.org/ooo | 1,564 | 253M | 4 |
OpenPDF (Java) | openpdf | https://github.com/LibrePDF/OpenPDF | 32 | 3.2M | 5 |
parsr (JS) | parsr | https://github.com/axa-group/Parsr | 28 | 12M | 5 |
Mozilla pdf.js (JS) | pdf.js | https://github.com/mozilla/pdf.js | 2,368 | 4.5G | 4 |
Apache PDFBOX (Java) | PDFBOX | https://issues.apache.org/jira/projects/PDFBOX | 3,832 | 2.7G | 1 |
pdfcpu (Go) | pdfcpu | https://github.com/pdfcpu/pdfcpu | 100 | 218M | 5 |
Chromium PDFium (C++) | PDFIUM | https://bugs.chromium.org/p/pdfium/issues/list | 379 | 212M | 5 |
pdfkit (JS) | pdfkit | https://github.com/foliojs/pdfkit | 38 | 35M | 5 |
pdfminer.six (Python) | pdfminer.six | https://github.com/pdfminer/pdfminer.six | 123 | 106M | 5 |
PikePDF (Python) | pikepdf | https://github.com/pikepdf/pikepdf | 23 | 30M | 5 |
Apache POI | POI | https://bz.apache.org/bugzilla/ | 11 | 940K | 5 |
Poppler (C/C++) | poppler | https://bugs.freedesktop.org | 1,585 | 6.2G | 5 |
Poppler (C/C++) | poppler-gitlab | https://gitlab.freedesktop.org/poppler/poppler | 463 | 926M | 6 |
Prawn PDF (Ruby) | prawn | https://github.com/prawnpdf/prawn | 53 | 69M | 5 |
qpdf (C++) | qpdf | https://github.com/qpdf/qpdf | 111 | 324M | 5 |
react-pdf (JS) | react-pdf | https://github.com/diegomura/react-pdf | 14 | 2.2M | 5 |
Redhat Linux | REDHAT | https://bugzilla.redhat.com/ | 1,712 | 1.3G | 5 |
Sumatra PDF (C/C++) | sumatrapdf | https://github.com/sumatrapdfreader/sumatrapdf | 320 | 788M | 5 |
tabula | tabula | https://github.com/tabulapdf/tabula | 2 | 172K | 5 |
tabula-java (Java) | tabula-java | https://github.com/tabulapdf/tabula-java | 77 | 45M | 5 |
Apache TIKA (Java) | TIKA | https://issues.apache.org/jira/projects/TIKA | 155 | 156M | 2 |
TOTAL: | 35 | - | 32,679 | 31G | - |
The fundamental value of the corpus comes from the ability to leverage and learn from the issues discovered by others. As a simple example; a quick test of pdfcpu (the Go-based PDF parser) resulted in 12 files timing out while being processed. An issue has been raised and already fixed:
Version 2 of the “Issue Tracker” corpus also re-crawled all the issue trackers that were in the initial release. As a result, PDF file attachments removed by their maintainers since version 1 will no longer be in the version 2 corpus downloads. The initial release of the “Issue Tracker” corpus will remain available for download at https://corpora.tika.apache.org/base/packaged/pdfs/archive/.
This README file describes the overall issue tracker corpus and how data has been collated.
This README file describes the PDF-centric issue tracker corpus that is pre-packaged into six compressed tarball (.tgz files) (see https://corpora.tika.apache.org/base/packaged/pdfs/). The broader multi-format Issue Tracker corpus, which includes many more formats than just PDF and is used for testing Apache Tika, is at https://corpora.tika.apache.org/base/docs/, however these files are not pre-packaged.
As the corpus is a direct and unfiltered collation of attachment data from numerous public issue trackers, the corpus does contain a few PDF files that will trigger anti-virus/anti-malware software on various platforms. Like everything downloaded from the internet, you should always ensure good cyber-hygiene practices when handling this data.
For more information and to stay up-to-date with the “Issue Tracker” PDF corpus, please join the corpora-dev@tika.apache.org email list (via https://tika.apache.org/mail-lists.html) and, for PDF Association members, please provide your feedback or comments in the PDF TWG.
The PDF Association again wishes to thank the NASA JPL and Apache Tika teams, and particularly Dr. Tim Allison, for their efforts in maintaining the technology and collating the data. We also wish to thank Maruan Sahyoun of PDF Association member FileAffairs GmbH, part of the Apache PDFBox team, for hosting the “Issue Tracker” PDF corpus as a valuable new industry resource.
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0079. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Approved for public release.