PDF Association logo

Discover pdfa.org

Key resources

Get involved

How do you find the right PDF technology vendor?
Use the Solution Agent to ask the entire PDF communuity!
The PDF Association celebrates its members’ public statements
of support
for ISO-standardized PDF technology.

Member Area

AI pirates and more from PDF in the Wild!

April 15, 2025
AI’s Pirated-Books Problem | Correcting US financial regulations referencing PDF | New crawl limit improves CC’s capture of PDF | W3C note on “Accessibility of machine learning and generative AI” | What is so plain about plain text? | Surveying PDF accessibility remediators | Courts’ rules lack detail | PDFacademicBot for April 2025
PDF Association staff
About PDF Association staff

The PDF Association staff delivers a vendor-neutral platform in service of PDF’s stakeholders.


“AI’s Pirated-Books Problem”

The issue of data rights management when dealing with AI’s huge appetite for data is a very hot topic, as The Atlantic reports in their recent article “The Unbelievable Scale of AI’s Pirated-Books Problem”. Open Future’s 2024 report on “A Vocabulary for opting out of AI Training and Other forms of TDM” discusses the broader range of systems to address these issues.

A great deal of the content devoured by AI is in PDF format. It seems, therefore, timely to point out that adding W3C’s TDMRep (Text and Data Mining Reservation) to a PDF’s XMP metadata is a simple step authors and publishers can take to “express the reservation of rights relative to text & data mining (TDM) applied to lawfully accessible Web content, and to ease the discovery of TDM licensing policies associated with such content”.

Authors and publishers concerned by AI mining should include consideration of TDMRep in their regular workflow for new publications. TDMRep can also be retroactively applied to existing PDF documents by using an incremental update of the XMP metadata. Speak to your preferred PDF vendor for more information.

The PDF Association also provides a free XMP Extension Schema Template for TDMRep for inclusion in PDF/A-1, PDF/A-2, and PDF/A-3 files.

Correcting US financial regulations referencing PDF

The PDF Association filed a comment against a proposed new rule in the US Financial Data Transparency Act Joint Data Standard regarding correcting and clarifying references to PDF technology.

Specifically, we noted the need to comply with ISO 19005 (PDF/A) helps to ensure that the recordkeeping requirements of 44 U.S.C. 3502(13) are achieved, but cannot, by itself, ensure the inclusion of embedded “schema and taxonomy data” to “provide necessary metadata that allows for automated data extraction”.

This Proposed Rule was brought to our attention by developers working with XBRL International, as they research methods to include inline XBRL data in PDF documents to meet this proposed rule.

New crawl limit improves CC’s capture of PDF

As reported by the JPL SafeDocs research team in 2020, previous Common Crawl’s web crawls truncated the collected files at 1MB, resulting in 22% (430,000 PDFs!) of the December 2019 crawl being invalidated.

In its announcement of the March 2025 crawl, CommonCrawl noted an increase in this limit to 5MB. As a result, CC reports fetching 13% more content, including reduced truncation of PDF files, now down to just 6.8%.

As an active participant in the SafeDocs research program, we’d like to think we influenced CommonCrawl to make this change so that more PDF documents are getting reliably preserved!

W3C note on “Accessibility of machine learning and generative AI”

In a new draft note titled “Accessibility of machine learning and generative AI”, the W3C has begun assessing how machine learning and generative AI can impact, both positively and negatively, the accessibility of web content.

With many PDF applications now including AI capabilities, it will be interesting to see where this work leads since accessibility is primarily concerned with content, irrespective of its format.

What is so plain about plain text?

Ross Spencer’s recent blog post provides an examination of significant properties and the importance of context even to something as plain as plain text files:

Take for example plain-text … – if I write this blog in plain-text we have a document, or record, something recognizable as the written language, literally, it’s a blog. Yet, if I use this same foundational “file format” and instead write:

#! /usr/bin/bash
set -eux
echo “hello world”

I have a shell script that’s going to be evaluated as at least three commands in a Linux environment.

The post highlights that understanding the intent and context of the preservation needs of “plain text” files is far more complex and multi-faceted than many realize, especially since many plain text files are themselves technical file formats (XML, JSON, YAML, etc).
When considering other, more complex file formats, such as PDF, some previous work has recommended an approach that focuses on a simultaneous awareness of specification, implementation, and situation to provide complete context.

As Spencer notes at the end of his post:

Context as we know, is everything to the archive and the archivist. Without it, you just have something, but you don’t know what.

We have always highlighted that the use of PDF/A should occur within an appropriate policy and procedural framework. The respective ISO documents addressing PDF only provide the specification pillar. Validators, and forensic tools can help address implementation questions, but other aspects are, inevitably, matters of governance.

If you’re interested in similar discussions related to the preservation of textual content, then you may be interested in ISO 20271, which sets out to establish a format agnostic reference model for long-term preservation of textual content. This standard is in development alongside PDF standards in ISO TC 171 SC 2 WG 10; please join us via our DocRM LWG community if not the ISO WG itself.

Surveying PDF accessibility remediators

Remediate PDFs for accessibility? You should respond to Karlen Communications’ 3rd annual research survey specifically for PDF Remediators!

The purpose of the research survey is to:

  • Provide feedback on the accessibility of the user interface of PDF remediation software.
  • Provide feedback on the tools that work and those that don’t in the “ease of PDF remediation” process.
  • Give a voice to those creating, remediating, and fixing tagged accessible PDFs.

This survey closes May 2, 2025. Raw data will be published on the research page at the Karlen Communications website.

Courts' rules lack detail

In this case, the court reminds an attorney appearing before it of the court’s local rules, which require submissions to be “in a searchable Portable Document Format (PDF)”, and warns that future submissions that fail in this regard “...may not be considered.”

“Searchability” is not a formally defined term for PDF because content comes in many forms (e.g., scanned pages, photographic images) and is dependent on software as to what PDF features are searched (e.g., annotations, metadata, embedded files, etc.).

PDF/A (ISO 19005), developed with input from the US Courts since 2001, ensures that conforming documents will be equivalently discoverable and presentable over time. However, PDF/A, as a file format, does not guarantee that the PDF is “searchable” since the presence of text content cannot be mandated. Although the starting point for such deliverables should always be compliance with PDF/A, the complexities of real-world content mean that additional instructions are necessary to ensure text searchability appropriate for the court’s purposes.

Other areas in which courts could improve their references to PDF, such as this case from 2024. It’s not “Adobe Acrobat portable document format” anymore – PDF became an open standard in 2008!

PDFacademicBot for April 2025

Abedrapo Rosen, I., Hartley Belmar, R. and Sánchez Núñez, P. (March 2025) ‘Open Science governance: the role of persistent identifiers and metadata standards’. Preprint. Open Science Framework. https://doi.org/10.31219/osf.io/9h564_v1.

Chenyu Mao (February 2025) Enhancing Layout Understanding via Human-in-the-Loop: A User Study on PDF-to-HTML Conversion for Long Documents. Master Of Science, Computer Science and Applications. Virginia State University. https://vtechworks.lib.vt.edu/server/api/core/bitstreams/6e305f44-f49a-4e76-8c0c-12a7c2e3cfd0/content.

Fang, L. et al. (April 2025) ‘uCite: The union of nine large-scale public PubMed citation datasets with reliability filtering’, Data in Brief, p. 111535. https://doi.org/10.1016/j.dib.2025.111535.

Gawli, P. et al. (2025) ‘DocGenie: A Retrieval-Augmented Generation Chatbot for Interactive PDF Exploration’, in 2025 International Conference on Electronics and Renewable Systems (ICEARS), Tuticorin, India: IEEE, pp. 659–664. https://doi.org/10.1109/ICEARS64219.2025.10940621.

Karthik Jayanthi, D.M.V.S.V. et al. (2025) ‘Document Summariser & Questiongenerator Using LLMS’, in 2025 First International Conference on Advances in Computer Science, Electrical, Electronics, and Communication Technologies (CE2CT), Bhimtal, Nainital, India: IEEE, pp. 208–213. https://doi.org/10.1109/CE2CT64011.2025.10939642.

Kawde, D. and Mendhe, S. (February 2025) ‘The Power of PDF-to-Audio Summaries in Modern Learning’, in 2025 4th International Conference on Sentiment Analysis and Deep Learning (ICSADL), pp. 245–249. https://doi.org/10.1109/ICSADL65848.2025.10933387.

Li, X., Bianculli, D. and Briand, L. (April 2025) ‘Tracing content requirements in financial documents using multi-granularity text analysis’, Requirements Engineering [Preprint]. https://doi.org/10.1007/s00766-025-00436-7.

Schmitt-Koopmann, F.M. et al. (March 2025) ‘Towards More Accessible Scientific PDFs for People with Visual Impairments: Step-by-Step PDF Remediation to Improve Tag Accuracy’. Preprint. http://arxiv.org/abs/2503.22216.

Stahl, C.G. et al. (April 2025) ‘PDF Entity Annotation Tool (PEAT)’, Journal of Open Source Software, 10(108), p. 5336. https://doi.org/10.21105/joss.05336.

Wang, S. et al. (May 2025) ‘Construction regulatory document digitalization with layout knowledge-informed object detection and semantic text recognition’, Advanced Engineering Informatics, 65, p. 103278. https://doi.org/10.1016/j.aei.2025.103278.

Zhang, H. and Shang, J. (2025) ‘Multiformat Document Parsing and Management’, in H. Zhang and J. Shang (eds) Natural Language Processing and Applications. Singapore: Springer Nature, pp. 115–132. https://doi.org/10.1007/978-981-97-9739-4_6.

WordPress Cookie Notice by Real Cookie Banner