New FAQ: AI and PDF

About PDF Association staff

This FAQ is designed to educate journalists, social media commentators and others lacking expertise in PDF technology on why and how AI systems use PDF documents.

PDF Association newsMarch 31, 2026

New FAQ: AI and PDF

This FAQ is designed to educate journalists, social media commentators and others lacking expertise in PDF technology on why and how AI systems use PDF documents.

PDF Association newsMarch 31, 2026

About PDF Association staff

AI hasn't figured out all of PDF yet. Our new FAQ is designed to educate everyone on why and how AI systems should understand PDF documents.

Today we're announcing an essential new resource – FAQ: AI and PDF – to provide clarity on the fundamental importance of PDF documents to emerging AI and large language models (LLMs). The FAQ is designed to combat misinformation and educate journalists, social media commentators, and others lacking expertise in PDF technology on why PDFs are critical to AI, how to prepare PDF files for AI, known limitations, and other considerations.

Content in PDF files is highly valued by AI systems because PDFs often serve as persistent electronic documents and as the established “document of record” in human communication. This contrasts sharply with HTML web pages, which are often transactional, short, and subject to change. Citing a recent HuggingFace blog post, the FAQ points out that content in PDF tends to offer higher information density and is inherently long-context.

Despite their intrinsic value, processing PDF files is often perceived as difficult for AI. PDF’s nature as a binary file format can make it seem like sorcery compared to text-based formats like Markdown and JSON. This misunderstanding commonly leads to inefficient practices that severely limit AI understanding.

As the FAQ points out, down-converting PDF files to other formats, such as plain text or Markdown for ingestion, is generally a poor strategy. The FAQ warns that this conversion is “inevitably lossy” in terms of rich information and semantics, serving as an unnecessary “dumbing down” process that risks increasing AI hallucinations. For example, converting text with ~~strikeout~~ to plain text loses the semantic significance, leading to loss of intended meaning.

The key to optimal AI ingestion, the FAQ stresses, is leveraging all of PDF's inherent features, especially PDF’s semantic capabilities. Tagged PDF documents provide rich semantic information, including logical reading order, natural language indicators, table structure, and alt-text for images, all of which can help AI to understand a document's structure while minimizing computational costs. These tags provide the document’s unpaginated logical structure, helping AI systems to completely avoid the need to understand pagination artifacts.

For optimal understanding, AI ingestion systems must ingest all components, including annotations (such as text markup, digital signatures, and multimedia) and rich XMP metadata; ignoring this information reduces overall understanding and increases the risk of hallucinations.

The resource also addresses the rapidly evolving landscape of copyright and AI training, guiding publishers to indicate their rights and preferences for Text and Data Mining (TDM), including opting out of training, for example, by including XMP metadata in accordance with the W3C’s TDMRep protocol for PDF. The PDF Association continues to work alongside industry, publishers, and regulators to ensure that PDF can encapsulate various methods for expressing TDM rights.

FAQ: AI and PDF clarifies how AI should handle PDF documents to ensure accuracy and prevent common errors. The FAQ includes a public feedback mechanism inviting new questions or requests for clarification.

A live webinar introducing the FAQ to analysts, journalists, commentators, and policy-makers – and allowing for extended Q & A – will be announced shortly.

Featured articles

Discover pdfa.org

Key resources

Get involved

New FAQ: AI and PDF