When will AI understand document semantics?

The PDF Association staff delivers a vendor-neutral platform in service of PDF’s stakeholders.


Progress on HDR standards
In the last week of February, ISO TC 42 WG 23 met in Japan to progress discussions on High Dynamic Range (HDR). Work items included:
- ISO/DIS 21496-1 Digital Photography — Gain map metadata for image conversionPart 1: Dynamic Range Conversion
- ISO/TS 22028-5:2023 Photography and graphic technology — Extended colour encodings for digital image storage, manipulation and interchange — Part 5: High dynamic range and wide colour gamut encoding for still images (HDR/WCG)
- ISO/PWI 25092 - HDR for still images - Best practices
- Work related to clarifying HDR terminology
Members of the PDF Association may access all ISO documents via the Imaging Model TWG's page (login required) - see the link under Technical Resources.
Following TC 46 WG 23 the International Color Consortium met, dedicating March 3 to “ICC HDR Stills Experts Day” at Chiba University.
Indicating whether the publisher allows PDF files to be mined by AI
The W3C has defined a “TDM Reservation Protocol (TDMRep)” to express the reservation of rights relative to text and data mining (TDM) applied to lawfully accessible content and to ease the discovery of TDM licensing policies associated with such content. This includes information on how to include TDRep in HTML, EPUB 2, EPUB 3, and, of course, PDF files.
In PDF files, TDRep is defined using the document catalog XMP metadata stream:
<rdf:RDF xmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#> <rdf:Description rdf:about="" ... xmlns:tdm="http://www.w3.org/ns/tdmrep/"> <tdm:reservation>1</tdm:reservation> <tdm:policy>https://publisher.com/policies/policy.json</tdm:policy> </rdf:Description> </rdf:RDF>
Copied from there W3C final report, example 8.
The PDF Association wishes to clarify that the TDMRep properties in the document catalog XMP metadata apply to all pages and all resources in a PDF file (including images, attachments, associated files, etc.). We provide a free XMP Extension Schema template enabling PDF/A-1, PDF/A-2 and PDF/A-3 conforming documents to express TDM rights.
Will LLM-based OCR live up to the challenge?
This recent article in Ars Technica discusses whether the new trend of LLM-based OCR and vision-based LLMs will displace traditional OCR techniques, with the assessment “these promotional claims don't always match real-world performance, according to recent tests.”
Although results show some improved understanding of complex document layout, hallucinations and other issues undermine reliability of the resultant text.
When will AI understand document semantics?
We recently crossed paths with this significant GitHub pull request on LangChain-AI which reiterates what we have been asserting all along:
Rational
Even though Document has a page_content parameter (rather than text or body), we believe it’s not good practice to work with pages. Indeed, this approach creates memory gaps in RAG projects. If a paragraph spans two pages, the beginning of the paragraph is at the end of one page, while the rest is at the start of the next. With a page-based approach, there will be two separate chunks, each containing part of a sentence. The corresponding vectors won’t be relevant. These chunks are unlikely to be selected when there’s a question specifically about the split paragraph. If one of the chunks is selected, there’s little chance the LLM can answer the question. This issue is worsened by the injection of headers, footers (if parsers haven’t properly removed them), images, or tables at the end of a page, as most current implementations tend to do.
An example of AI confusion is this GIGO story in which proper document semantics would clearly have avoided the issue through logical reading order in column order! The original PDF comprises full page scanned images with an invisible text layer (using text render mode 3) of a 1959 article. The OCR engine used to create the invisible text layer has not recognized the columnar layout and has rendered the invisible text in sequence. AI has then simply seen the text and blindly processed it in rendering order.
Let’s hope the LangChain-AI developers can finally bring rich document semantics to AI!
JHOVE PDF parsing improvements
The Open Preservation Foundation has posted a blog about improvements in PDF parsing in JHOVE 1.34. These improvements include recognising PDF 2.0, correctly processing PDF strings with balanced and unbalanced parentheses, parsing PDF date strings, and checking object numbers against the trailer size. We are glad to see these improvements, as they help preservation organizations ensure the validity of their PDF documents.
PDFacademicBot for March 2025
C. Akshitha et al. (2025) ‘Enhancing Question Answer Generation from PDFs: A Fusion of BERT, RAKE, T5 and DistilBERT with RQUGE Evaluation’, in Proceedings of 5th International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. 1st edn. Singapore: Springer Nature, pp. 249–260. https://books.google.com.au/books?id=kr1KEQAAQBAJ.
Alam, K. et al. (2025) ‘Co-Pilot for Project Managers: Developing a PDF-Driven AI Chatbot for Facilitating Project Management’, IEEE Access, pp. 1–1. https://doi.org/10.1109/ACCESS.2025.3548519.
Arshi, M., Pathi, N.K. and Agarwal, R. (Feb. 2025) ‘PDF guard - Advanced malicious PDF detection tool’, AIP Conference Proceedings, 3237(1), p. 030051. https://doi.org/10.1063/5.0247140.
Deng, J. et al. (March 2025) ‘An automatic selective PDF table-extraction method for collecting materials data from literature’, Advances in Engineering Software, 204, p. 103897. https://doi.org/10.1016/j.advengsoft.2025.103897.
Esther Rajakumari, K. et al. (2025) ‘AI Doc Question and Answering System’, in 2025 International Conference on Multi-Agent Systems for Collaborative Intelligence (ICMSCI). 2025 International Conference on Multi-Agent Systems for Collaborative Intelligence (ICMSCI), Erode, India: IEEE, pp. 94–99. https://doi.org/10.1109/ICMSCI62561.2025.10894448.
Iannuzzi, N. et al. (Feb. 2025) ‘Combined accessibility validation and monitoring of web sites and PDF documents’, Universal Access in the Information Society [Preprint]. https://doi.org/10.1007/s10209-025-01194-7.
Martin, G.L. et al. (2025) ‘Deep Defense Against Mal-Doc: Utilizing Transformer and SeqGAN for Detecting and Classifying Document Type Malware’, Applied Sciences, 15(6), p. 2978. https://doi.org/10.3390/app15062978.
Mayur Bhoyar et al. (2025) ‘InferSync TextLens: Cognitive PDF Comparison Information Technology’, in 6G Communications Networking and Signal Processing: Proceedings of the International Conference, SGCNSP 2023. Singapore: Springer Nature Singapore, pp. 323–334. https://books.google.com.au/books?hl=en&lr=lang_en&id=bnxGEQAAQBAJ&oi=fnd&pg=PA323&dq=PDF&ots=svqoGAYXjT&sig=J7mv1ejovF0JEChip--sKbsDEkg#v=onepage&q=PDF&f=false.
Nivetha, K.K. and Priyasadini, S. (Feb. 2025) ‘Annotate Ease: PDF Metadata Extraction Application Specializing in Research Publications’. Graduate Colloquium 2025, Sabaragamuwa University of Sri Lanka: Faculty of Computing. http://repo.lib.sab.ac.lk:8080/xmlui/handle/susl/4868
Penchev, B. and Todoranova, L. (Feb. 2025) ‘Accessibility of Electronic Resources for Students with Disabilities’, Acta Educationis Generalis, 15(1), pp. 19–30. https://doi.org/10.2478/atd-2025-0002.
R, Parvathi et al. (2025) ‘Rich Text Extraction and Classification From PDF Documents Using Deep Learning Techniques’, in Optimization, Machine Learning, and Fuzzy Logic: Theory, Algorithms, and Applications. IGI Global Scientific Publishing, pp. 383–402. https://doi.org/10.4018/979-8-3693-7352-1.ch016.
Şahin, Ş., Duymaz, Y.K. and Bahşi, İ. (Feb. 2025) ‘Expanding Perspectives on Three-dimensional Portable Document Format in Craniofacial Surgery’, Journal of Craniofacial Surgery, p. 10.1097/SCS.0000000000011183. https://doi.org/10.1097/SCS.0000000000011183.
Sahan, Z.M. and Abdulrahman, A.A. (March 2025) ‘Malware detection of PDF documents based on machine learning techniques (A review)’, AIP Conference Proceedings, 3264(1), p. 030036. https://doi.org/10.1063/5.0259404.
Michael Scholz and Jörg Bauer (2025) ‘OTE: A Tool for Extracting Tabular Purchase Order Information from PDF Documents’, in Information and Software Technologies. Google Books, pp. 161-172. https://books.google.com.au/books?hl=en&lr=lang_en&id=ui9NEQAAQBAJ&oi=fnd&pg=PA161&dq=PDF&ots=elmm5yprHI&sig=5NuiU_nS33M0khDOCoHnHiZgGpI#v=onepage&q=PDF&f=false.
Sheridan, G. (Dec. 2024) ‘Access All Areas: Breaking Down the Barriers to Legal Information’, Legal Information Management, 24(4), pp. 267–271. https://doi.org/10.1017/S1472669624000598.