AI researchers leverage SafeDocs PDF corpora

About PDF Association staff

Hugging Face.AI examines PDF. New PDF support in Android. Google reinvents a 31 year-old PDF feature. FireFox wants you to know that you can edit PDFs in its browser. It’s PDF in the Wild for April, 2024!

PDF in the WildApril 25, 2024

An example page of one pdf document, with added bounding boxes around words (red), lines (blue) and embedded images (green).

AI researchers leverage SafeDocs PDF corpora

PDF in the WildApril 25, 2024

About PDF Association staff

Hugging Face, a major collaboration platform for the machine learning community, has adopted the UNTRUNCATED SafeDocs PDF corpus for use as a pretraining basis for vision-language models.

Hugging Face logo, the AI community building the future. Under the title “Dataset Card for PDF Association dataset (PDFA)”, HuggingFace.AI says:

“Showcasing a curated collection of 8M PDF files, this dataset is celebrated for its diversity—featuring advertisements, brochures, manuals, pamphlets, and more, all annotated with text and image bounding boxes. It is very much a research experiment: can we train from pdfs in the wild, recover annotations just from pdf parse, and so on?”

The original CC-MAIN-2021-31-PDF-UNTRUNCATED corpus is intended to represent the diversity of extant files using the PDF format, while Hugging Face focuses on the content in these PDFs to make “... the dataset machine learning-ready for vision-language models”.

The Hugging Face dataset includes additional JSON data including text and image bounding boxes. As a result of their processing pipeline, they also exclude very large (>100 MB), very slow-to-render PDFs, and those that could not be processed using their specific PDF tools. They also acknowledge “the way we obtain an approximate reading order is simply by looking at the frequency peaks of the leftmost word x-coordinate”. The developers specifically state that common issues such as mojibake (garbled text due to decoding errors) are not addressed and may be present. Read more about this release!

In discussions with HuggingFace.AI, the PDF Association also made them aware of assumptions and biases caused by their tooling which they acknowledge in the section “Disclaimer and note to researchers”. We hope AI practitioners pay heed to these warnings!

Android 15 promises new PDF support

The Android 15 Developer Preview 2 includes improvements to its PdfRenderer APIs, allowing apps to incorporate support for longstanding PDF features such as password-protected files, annotations, form editing, searching, text selection, and linearized PDF.

Google reinvents PDF features

31 years ago - at its debut in 1993 - PDF’s “outlines” feature provided authors with the ability to add a table of contents that appeared to the side of the page, providing a rapid means of access to any predefined point in the document.

In 2024 the new Google Scholar PDF Reader re-invents this feature for PDF documents.

Google's screenshot showing their "outlines".

Firefox promotes in-browser PDF editing

The recent Firefox 124.0 release promotes an in-browser PDF editing functionality, as well as enabling caret browsing for PDFs to add text selection and accessibility, even though much of this pdf.js functionality has been around for a while.

Edit PDFs in Firefox promo image.

PDF COS Syntax VSCode extension accelerates past 2k downloads

Interest in the PDF Association’s utility for PDF developers continues to grow.

PDFacademicBot for April, 2024

Binsawad, M. (March 2024) ‘Enhancing PDF Malware Detection through Logistic Model Trees’, Computers, Materials & Continua, 78(3), pp. 3645–3663. https://doi.org/10.32604/cmc.2024.048183.

Butler, B. (2024) ‘Battle of the Books: Evaluating the Reading Comprehension of HTML and PDF Textbook Formats’, NISOD, 8 February. https://www.nisod.org/2024/02/08/xlvi_2/.

“Though there was not a statistically significant difference between the PDF group and the HTML group, there was a clear trend of higher performance across the PDF group overall. The results from this study can enable instructors to make informed decisions about the text options and other reading materials, such as research studies, provided to students.”

Dethier, E. et al. (2024) ‘Making Order in Household Accounting - Digital Invoices as Domestic Work Artifacts’, Computer Supported Cooperative Work (CSCW) [Preprint]. https://doi.org/10.1007/s10606-024-09495-w.

Fan, V. et al. (April 2024) ‘OpenChemIE: An Information Extraction Toolkit For Chemistry Literature’, Journal of Chemical Information and Modeling [Preprint]. https://arxiv.org/abs/2404.01462v1.

Gupta, V. and P, S.R. (April 2024) ‘Leveraging open-source models for legal language modeling and analysis: a case study on the Indian constitution’. http://arxiv.org/abs/2404.06751

Kavčič Čolić, A. and Hari, A. (2024) ‘Improving accessibility of digitization outputs: EODOPEN project research findings’, Digital Library Perspectives, ahead-of-print. https://doi.org/10.1108/DLP-09-2023-0080.

Li, H. and Zhang, J. (March 2024) ‘Information Extraction for Semantic Enrichment of BIM for Bridges’, Construction Research Congress 2024, pp. 629–638. https://doi.org/10.1061/9780784485262.064.

Hovious, A. and Wang, C. (March 2024) ‘Hidden Inequities of Access: Document Accessibility in an Aggregated Database’, Information Technology and Libraries, 43(1), p. 12. https://doi.org/10.5860/ital.v43i1.16661.

Madhav, D. et al. (2024) ‘Question Generation from PDF using LangChain’, in 2024 11th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India: IEEE, pp. 218–222. https://doi.org/10.23919/INDIACom61295.2024.10499105.

Nale, D.D. and Pise, A.C. (Feb. 2024) ‘Smart Result Analysis-pdf Extractor’, in 2024 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), pp. 1–6. https://doi.org/10.1109/SCEECS61402.2024.10482279.

Peña, A. et al. (2024) ‘Continuous document layout analysis: Human-in-the-loop AI-based data curation, database, and evaluation in the domain of public affairs’, Information Fusion, p. 102398. https://doi.org/10.1016/j.inffus.2024.102398.

Silvano, W.F. et al. (October 2023) ‘Confidentiality of Electronic Documents in the Civil Registry of Brazil’, p. 15. http://dx.doi.org/10.13140/RG.2.2.32998.23361.

Sontakke, P. et al. (Feb. 2024) ‘PDF-Driven Q&A: A Review’, International Journal for Research in Applied Science and Engineering Technology, 12, pp. 1553–1556. https://doi.org/10.22214/ijraset.2024.58670.

Zhu, H. et al. (2024) ‘A visual analysis approach for data transformation via domain knowledge and intelligent models’, Multimedia Systems, 30(3), p. 126. https://doi.org/10.1007/s00530-024-01331-x.

Featured articles

Discover pdfa.org

Key resources

Get involved

AI researchers leverage SafeDocs PDF corpora

Android 15 promises new PDF support

Google reinvents PDF features

Firefox promotes in-browser PDF editing

PDF COS Syntax VSCode extension accelerates past 2k downloads

PDFacademicBot for April, 2024