Thin support for PDF in AI applications makes for a thin business model

About PDF Association staff

LibreOffice now supports PDF/UA! AI LLMs’ support for PDF is substandard, and is retarding innovation. Yet more misinformation about malware. PDFacademicBot for January, 2024.

PDF in the WildJanuary 26, 2024

Thin support for PDF in AI applications makes for a thin business model

LibreOffice now supports PDF/UA! AI LLMs’ support for PDF is substandard, and is retarding innovation. Yet more misinformation about malware. PDFacademicBot for January, 2024.

PDF in the WildJanuary 26, 2024

About PDF Association staff

LibreOffice 7.6 ships with new support for PDF/UA

In the latest (December 7 2023) release of the venerable Office alternative, Tagged PDF is now the default when exporting to PDF, including the option to produce PDF/UA-conforming PDF!

Thin support for PDF in AI applications makes for a thin business model

OpenAI added support for PDF - not very good support, as ChatGPT admits - but it exists. As a result, other companies are now complaining that ChatGPT’s publisher has made it harder for them to provide similar - maybe better - support for PDF.

Our view? If you aren’t going to do a good job supporting PDF, leave room for others who might!

Misinformation about malware

While phishing attacks leveraging PDF files are growing increasingly sophisticated, TechHQ produces clickbait that helps drive a common misperception… and then recommends users to replace PDF files with… PostScript!

Of course, it’s true that PDF files are an excellent vehicle for phishing attacks… not because PDF is a dangerous file-format but because users may believe a PDF is authentic long enough to click on a link - a problem that PostScript files only solve by virtue of the fact that they can't include links at all! And that's before we start in on all the other reasons why PostScript isn't more secure..

The solution - as always - is to approach EVERY file with care, not just PDF files.

PDFacademicBot for January, 2024

Will Crichton, Shriram Krishnamurthi (2024) ‘A Core Calculus for Documents: Or, Lambda: The Ultimate Document’, in ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2024. Available at https://cs.brown.edu/~sk/Publications/Papers/Published/ck-core-calc-doc-lambda-ultimate-doc/

NOTE: This paper was especially recommended by DARPA’s SafeDocs Program Manager, Dr. Sergey Bratus, with the comment: “This may be a keystone academic paper for our Data Definition Languages efforts on SafeDocs, and particularly the chain-of-trust models for documents”.

Li, J. et al. (2023) ‘Design and implementation of an automated PDF drawing statistics tool based on Python’, in Sixth International Conference on Computer Information Science and Application Technology (CISAT 2023), SPIE, pp. 1686–1690. Available at:https://doi.org/10.1117/12.3003927.

Lindlar, M. (2023) ‘Towards a formalized workflow for format validation error treatment’, 19th International Conference on Digital Preservation, (2023), p. 11. Available at:https://www.ideals.illinois.edu/items/128296.

Shi, Y. et al. (2023) ‘Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation’. arXiv. Available at:https://doi.org/10.48550/arXiv.2310.16809.

“This paper presents a comprehensive evaluation of the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM). We assess the model's performance across a range of OCR tasks, including scene text recognition, handwritten text recognition, handwritten mathematical expression recognition, table structure recognition, and information extraction from visually-rich document. The evaluation reveals that GPT-4V performs well in recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks. Specifically, it showed limitations when dealing with non-Latin languages and complex tasks such as handwriting mathematical expression recognition, table structure recognition, and end-to-end semantic entity recognition and pair extraction from document image. Based on these observations, we affirm the necessity and continued research value of specialized OCR models. In general, despite its versatility in handling diverse OCR tasks, GPT-4V does not outperform existing state-of-the-art OCR models. How to fully utilize pre-trained general-purpose LMMs such as GPT-4V for OCR downstream tasks remains an open problem. The study offers a critical reference for future research in OCR with LMMs. Evaluation pipeline and results are available at https://github.com/SCUT-DLVCLab/GPT-4V_OCR.”

Omony, P. et al. (2023) Implementation of PDF to Audio Converter for visually impaired students. Thesis. Makerere University. Available at:http://dissertations.mak.ac.ug/handle/20.500.12281/17847.

Eroğlu, F.S. et al. (2023) ‘Effectiveness of using 2D atlas and 3D PDF as a teaching tool in anatomy lectures in initial learners: a randomized controlled trial in a medical school’, BMC Medical Education, 23(1), p. 962. Available at:https://doi.org/10.1186/s12909-023-04960-4.

Detecting Modern Maldocs: An Analytical Study of PDF Feature Importance Using KDD Cup 99 as the Process Framework - ProQuest (2024). Available at:https://www.proquest.com/openview/9d14303f969f105a5660ddbd9f4b64e6/1?pq-origsite=gscholar&cbl=18750&diss=y.

Wang, Y., Li, N. and Tian, Y. (2023) ‘Semantic Metadata Integration Support Method for Editable Re-flowable Document OOXML and Fixed-layout Document PDF’, in Proceedings of the 7th International Conference on Computer Science and Application Engineering. New York, NY, USA: Association for Computing Machinery (CSAE ’23), pp. 1–8. Available at:https://doi.org/10.1145/3627915.3628080.

Vishnu, B.V., Sharath, S.R. and Netravathi, B. (2023) ‘An Interactive Framework for Querying Data from Large PDF Files’, in 2023 International Conference on Recent Advances in Information Technology for Sustainable Development (ICRAIS). 2023 International Conference on Recent Advances in Information Technology for Sustainable Development (ICRAIS), pp. 25–30. Available at:https://doi.org/10.1109/ICRAIS59684.2023.10367090.

Abstract: With the evolution of huge language models, it is now possible to converse with robots on any topic. However, unstructured data, such as PDF files, must still be converted into a format that these models can easily analyze. Through this study, we present a framework for creating conversational PDF applications that are powered by language models, specifically OpenAI's GPT-3 architecture. Our framework, enables users to connect their language model APIs to other data sources, such as PDF files, and allows the language model to interact with the data source. We show our framework's usefulness by breaking a big PDF document into smaller chunks, transforming each chunk into embeddings, and constructing a knowledge base that users may query. The results obtained suggest that employing embeddings rather than comparing text directly can greatly reduce data size while improving search result accuracy.

Featured articles

Discover pdfa.org

Key resources

Get involved

Thin support for PDF in AI applications makes for a thin business model

LibreOffice 7.6 ships with new support for PDF/UA

Thin support for PDF in AI applications makes for a thin business model

Misinformation about malware

PDFacademicBot for January, 2024