Featured articles

Image of PDF 2.0 Errata Collection 3 now available

PDF 2.0 Errata Collection 3 now available

Declaring WCAG compliance in PDF

PDF 2.0: New Features, Real-World Impact

Discover pdfa.org

Review publications from the PDF Association and ISO
The technical index lists critical resources for developers
Learn which companies are PDF Association members
Review hundreds of presentations from our events

Key resources

Try the new VS Code extension for PDF syntax
Check out our cheat sheets for PDF developers
Get PDF’s latest specification, ISO 32000-2 at no cost
Add ISO 32000-2 errata via our public GitHub repo, and check out the resolutions

Get involved

Discover the benefits of PDF Association membership
Join the PDF Association!
Review the PDF technical community’s working groups

How do you find the right PDF technology vendor?
Use the Solution Agent to ask the entire PDF communuity!

The PDF Association celebrates its members’ public statements
of support for ISO-standardized PDF technology.

Technical resources

Technical resources

PDF Association on pdfa.org

Related company page

Become a Member!

Become a Member!

When will AI understand document semantics?

About PDF Association staff

When will AI understand document semantics? | Will LLM-based OCR live up to the challenge? | Indicating whether the publisher allows PDF files to be mined by AI | JHOVE PDF parsing improvements | Progress on HDR standards | PDFacademicBot for March, 2025.

PDF in the WildMarch 17, 2025

On the left, a screen-shot of the PDF page, on the right the PDF code showing

When will AI understand document semantics?

On the left, a screen-shot of the PDF page, on the right the PDF code showing

When will AI understand document semantics? | Will LLM-based OCR live up to the challenge? | Indicating whether the publisher allows PDF files to be mined by AI | JHOVE PDF parsing improvements | Progress on HDR standards | PDFacademicBot for March, 2025.

PDF in the WildMarch 17, 2025

About PDF Association staff

Progress on HDR standards

In the last week of February, ISO TC 42 WG 23 met in Japan to progress discussions on High Dynamic Range (HDR). Work items included:

ISO/DIS 21496-1 Digital Photography — Gain map metadata for image conversionPart 1: Dynamic Range Conversion
ISO/TS 22028-5:2023 Photography and graphic technology — Extended colour encodings for digital image storage, manipulation and interchange — Part 5: High dynamic range and wide colour gamut encoding for still images (HDR/WCG)
ISO/PWI 25092 - HDR for still images - Best practices
Work related to clarifying HDR terminology

Members of the PDF Association may access all ISO documents via the Imaging Model TWG's page (login required) - see the link under Technical Resources.

Following TC 46 WG 23 the International Color Consortium met, dedicating March 3 to “ICC HDR Stills Experts Day” at Chiba University.

Indicating whether the publisher allows PDF files to be mined by AI

The W3C has defined a “TDM Reservation Protocol (TDMRep)” to express the reservation of rights relative to text and data mining (TDM) applied to lawfully accessible content and to ease the discovery of TDM licensing policies associated with such content. This includes information on how to include TDRep in HTML, EPUB 2, EPUB 3, and, of course, PDF files.

In PDF files, TDRep is defined using the document catalog XMP metadata stream:

<rdf:RDF xmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#>
<rdf:Description rdf:about=""
  ...
  xmlns:tdm="http://www.w3.org/ns/tdmrep/">
      <tdm:reservation>1</tdm:reservation>
      <tdm:policy>https://publisher.com/policies/policy.json</tdm:policy>
    </rdf:Description>
  </rdf:RDF>

Copied from there W3C final report, example 8.

The PDF Association wishes to clarify that the TDMRep properties in the document catalog XMP metadata apply to all pages and all resources in a PDF file (including images, attachments, associated files, etc.). We provide a free XMP Extension Schema template enabling PDF/A-1, PDF/A-2 and PDF/A-3 conforming documents to express TDM rights.

Will LLM-based OCR live up to the challenge?

This recent article in Ars Technica discusses whether the new trend of LLM-based OCR and vision-based LLMs will displace traditional OCR techniques, with the assessment “these promotional claims don't always match real-world performance, according to recent tests.”

Although results show some improved understanding of complex document layout, hallucinations and other issues undermine reliability of the resultant text.

When will AI understand document semantics?

We recently crossed paths with this significant GitHub pull request on LangChain-AI which reiterates what we have been asserting all along:

Rational
Even though Document has a page_content parameter (rather than text or body), we believe it’s not good practice to work with pages. Indeed, this approach creates memory gaps in RAG projects. If a paragraph spans two pages, the beginning of the paragraph is at the end of one page, while the rest is at the start of the next. With a page-based approach, there will be two separate chunks, each containing part of a sentence. The corresponding vectors won’t be relevant. These chunks are unlikely to be selected when there’s a question specifically about the split paragraph. If one of the chunks is selected, there’s little chance the LLM can answer the question. This issue is worsened by the injection of headers, footers (if parsers haven’t properly removed them), images, or tables at the end of a page, as most current implementations tend to do.

An example of AI confusion is this GIGO story in which proper document semantics would clearly have avoided the issue through logical reading order in column order! The original PDF comprises full page scanned images with an invisible text layer (using text render mode 3) of a 1959 article. The OCR engine used to create the invisible text layer has not recognized the columnar layout and has rendered the invisible text in sequence. AI has then simply seen the text and blindly processed it in rendering order.

On the left, a screen-shot of the PDF page, on the right the PDF code showing "vegetative electron microscopy" in order.

Let’s hope the LangChain-AI developers can finally bring rich document semantics to AI!

JHOVE PDF parsing improvements

The Open Preservation Foundation has posted a blog about improvements in PDF parsing in JHOVE 1.34. These improvements include recognising PDF 2.0, correctly processing PDF strings with balanced and unbalanced parentheses, parsing PDF date strings, and checking object numbers against the trailer size. We are glad to see these improvements, as they help preservation organizations ensure the validity of their PDF documents.

PDFacademicBot for March 2025

C. Akshitha et al. (2025) ‘Enhancing Question Answer Generation from PDFs: A Fusion of BERT, RAKE, T5 and DistilBERT with RQUGE Evaluation’, in Proceedings of 5th International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. 1st edn. Singapore: Springer Nature, pp. 249–260. https://books.google.com.au/books?id=kr1KEQAAQBAJ.

Alam, K. et al. (2025) ‘Co-Pilot for Project Managers: Developing a PDF-Driven AI Chatbot for Facilitating Project Management’, IEEE Access, pp. 1–1. https://doi.org/10.1109/ACCESS.2025.3548519.

Arshi, M., Pathi, N.K. and Agarwal, R. (Feb. 2025) ‘PDF guard - Advanced malicious PDF detection tool’, AIP Conference Proceedings, 3237(1), p. 030051. https://doi.org/10.1063/5.0247140.

Deng, J. et al. (March 2025) ‘An automatic selective PDF table-extraction method for collecting materials data from literature’, Advances in Engineering Software, 204, p. 103897. https://doi.org/10.1016/j.advengsoft.2025.103897.

Esther Rajakumari, K. et al. (2025) ‘AI Doc Question and Answering System’, in 2025 International Conference on Multi-Agent Systems for Collaborative Intelligence (ICMSCI). 2025 International Conference on Multi-Agent Systems for Collaborative Intelligence (ICMSCI), Erode, India: IEEE, pp. 94–99. https://doi.org/10.1109/ICMSCI62561.2025.10894448.

Iannuzzi, N. et al. (Feb. 2025) ‘Combined accessibility validation and monitoring of web sites and PDF documents’, Universal Access in the Information Society [Preprint]. https://doi.org/10.1007/s10209-025-01194-7.

Martin, G.L. et al. (2025) ‘Deep Defense Against Mal-Doc: Utilizing Transformer and SeqGAN for Detecting and Classifying Document Type Malware’, Applied Sciences, 15(6), p. 2978. https://doi.org/10.3390/app15062978.

Mayur Bhoyar et al. (2025) ‘InferSync TextLens: Cognitive PDF Comparison Information Technology’, in 6G Communications Networking and Signal Processing: Proceedings of the International Conference, SGCNSP 2023. Singapore: Springer Nature Singapore, pp. 323–334. https://books.google.com.au/books?hl=en&lr=lang_en&id=bnxGEQAAQBAJ&oi=fnd&pg=PA323&dq=PDF&ots=svqoGAYXjT&sig=J7mv1ejovF0JEChip--sKbsDEkg#v=onepage&q=PDF&f=false.

Nivetha, K.K. and Priyasadini, S. (Feb. 2025) ‘Annotate Ease: PDF Metadata Extraction Application Specializing in Research Publications’. Graduate Colloquium 2025, Sabaragamuwa University of Sri Lanka: Faculty of Computing. http://repo.lib.sab.ac.lk:8080/xmlui/handle/susl/4868

Penchev, B. and Todoranova, L. (Feb. 2025) ‘Accessibility of Electronic Resources for Students with Disabilities’, Acta Educationis Generalis, 15(1), pp. 19–30. https://doi.org/10.2478/atd-2025-0002.

R, Parvathi et al. (2025) ‘Rich Text Extraction and Classification From PDF Documents Using Deep Learning Techniques’, in Optimization, Machine Learning, and Fuzzy Logic: Theory, Algorithms, and Applications. IGI Global Scientific Publishing, pp. 383–402. https://doi.org/10.4018/979-8-3693-7352-1.ch016.

Şahin, Ş., Duymaz, Y.K. and Bahşi, İ. (Feb. 2025) ‘Expanding Perspectives on Three-dimensional Portable Document Format in Craniofacial Surgery’, Journal of Craniofacial Surgery, p. 10.1097/SCS.0000000000011183. https://doi.org/10.1097/SCS.0000000000011183.

Sahan, Z.M. and Abdulrahman, A.A. (March 2025) ‘Malware detection of PDF documents based on machine learning techniques (A review)’, AIP Conference Proceedings, 3264(1), p. 030036. https://doi.org/10.1063/5.0259404.

Michael Scholz and Jörg Bauer (2025) ‘OTE: A Tool for Extracting Tabular Purchase Order Information from PDF Documents’, in Information and Software Technologies. Google Books, pp. 161-172. https://books.google.com.au/books?hl=en&lr=lang_en&id=ui9NEQAAQBAJ&oi=fnd&pg=PA161&dq=PDF&ots=elmm5yprHI&sig=5NuiU_nS33M0khDOCoHnHiZgGpI#v=onepage&q=PDF&f=false.

Sheridan, G. (Dec. 2024) ‘Access All Areas: Breaking Down the Barriers to Legal Information’, Legal Information Management, 24(4), pp. 267–271. https://doi.org/10.1017/S1472669624000598.

Print Friendly, PDF & Email

WordPress Cookie Notice by Real Cookie Banner