RESOURCE

Stressful PDF Corpus

Bell cure of PDF complexity. The “Issue Tracker” corpus of stressful PDF files was originally developed under the DARPA-funded “SafeDocs” program as discussed on pdfa.org.

If a “stressful PDF” can be considered as any file that causes problems for a parser, then looking into the problems faced by diverse parsers can be a great learning experience.

This corpus now includes bug attachment data from 35 issue tracker repositories across 32 PDF technologies, comprising 31 GB and over 32,500 stressful PDF files.

These issue trackers now span a broad variety of PDF technologies written in a wide range of programming languages. Due to the size, we have packaged the corpus into six compressed tar balls (.tgz files) each containing the data from multiple repositories to make downloading more convenient.

PDF technology	Folder	Issue Tracker URL	# files	Size	.tgz file)
Android PDF Viewer (Java)	androidpdfviewer	https://github.com/barteksc/AndroidPdfViewer	13	3.2M	5
Cairo	cairo	https://bugs.freedesktop.org	166	33M	5
Cairo	cairo-gitlab	https://gitlab.freedesktop.org/cairo/cairo	29	12M	6
DeJaVu	dejavu	https://bugs.freedesktop.org	39	2.7M	5
eSignature DSS	DSS	https://ec.europa.eu/cefdigital/tracker/projects/DSS	243	89M	5
GNOME Evince	evince	https://gitlab.gnome.org/GNOME/evince	241	591M	6
Apache FOP	FOP	https://issues.apache.org/jira/projects/FOP	808	157M	5
GhostScript (C/C++)	GHOSTSCRIPT	https://bugs.ghostscript.com/	5,458	5.6G	2
Snappy PDF (laravel, PHP)	laravel-snappy	https://github.com/barryvdh/laravel-snappy	5	1.8M	5
Libre Office	LIBRE_OFFICE	https://bugs.documentfoundation.org/	5,572	1.4G	4
libvips image library	libvips	https://github.com/libvips/libvips	18	384M	5
Mozilla	MOZILLA	https://bugzilla.mozilla.org/	6,879	3.9G	3
Apache Nutch	NUTCH	https://issues.apache.org/jira/projects/NUTCH	13	976K	5
OCRmyPDF (Python)	ocrmypdf	https://github.com/jbarlow83/OCRmyPDF	205	501M	5
Apache OpenOffice.org	OOO	https://bz.apache.org/ooo	1,564	253M	4
OpenPDF (Java)	openpdf	https://github.com/LibrePDF/OpenPDF	32	3.2M	5
parsr (JS)	parsr	https://github.com/axa-group/Parsr	28	12M	5
Mozilla pdf.js (JS)	pdf.js	https://github.com/mozilla/pdf.js	2,368	4.5G	4
Apache PDFBOX (Java)	PDFBOX	https://issues.apache.org/jira/projects/PDFBOX	3,832	2.7G	1
pdfcpu (Go)	pdfcpu	https://github.com/pdfcpu/pdfcpu	100	218M	5
Chromium PDFium (C++)	PDFIUM	https://bugs.chromium.org/p/pdfium/issues/list	379	212M	5
pdfkit (JS)	pdfkit	https://github.com/foliojs/pdfkit	38	35M	5
pdfminer.six (Python)	pdfminer.six	https://github.com/pdfminer/pdfminer.six	123	106M	5
PikePDF (Python)	pikepdf	https://github.com/pikepdf/pikepdf	23	30M	5
Apache POI	POI	https://bz.apache.org/bugzilla/	11	940K	5
Poppler (C/C++)	poppler	https://bugs.freedesktop.org	1,585	6.2G	5
Poppler (C/C++)	poppler-gitlab	https://gitlab.freedesktop.org/poppler/poppler	463	926M	6
Prawn PDF (Ruby)	prawn	https://github.com/prawnpdf/prawn	53	69M	5
qpdf (C++)	qpdf	https://github.com/qpdf/qpdf	111	324M	5
react-pdf (JS)	react-pdf	https://github.com/diegomura/react-pdf	14	2.2M	5
Redhat Linux	REDHAT	https://bugzilla.redhat.com/	1,712	1.3G	5
Sumatra PDF (C/C++)	sumatrapdf	https://github.com/sumatrapdfreader/sumatrapdf	320	788M	5
tabula	tabula	https://github.com/tabulapdf/tabula	2	172K	5
tabula-java (Java)	tabula-java	https://github.com/tabulapdf/tabula-java	77	45M	5
Apache TIKA (Java)	TIKA	https://issues.apache.org/jira/projects/TIKA	155	156M	2
TOTAL:	35	–	32,679	31G	–

This README file describes the overall issue tracker corpus and how data has been collated.

This README file describes the PDF-centric issue tracker corpus that is pre-packaged into six compressed tarball (.tgz files) (see https://corpora.tika.apache.org/base/packaged/pdfs/). The broader multi-format Issue Tracker corpus, which includes many more formats than just PDF and is used for testing Apache Tika, is at https://corpora.tika.apache.org/base/docs/, however these files are not pre-packaged.

For more information and to stay up-to-date with the “Issue Tracker” PDF corpus, please join the corpora-dev@tika.apache.org email list (via https://tika.apache.org/mail-lists.html) and, for PDF Association members, please provide your feedback or comments in the PDF TWG.

The PDF Association again wishes to thank the NASA JPL and Apache Tika teams, and particularly Dr. Tim Allison, for their efforts in maintaining the technology and collating the data. We also wish to thank Maruan Sahyoun of PDF Association member FileAffairs GmbH, part of the Apache PDFBox team, for hosting the “Issue Tracker” PDF corpus as a valuable new industry resource.

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0079. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Approved for public release.

RESOURCE INFO

The PDF Association

Featured articles

Discover pdfa.org

Key resources

Get involved

Stressful PDF Corpus