PDF in the Wild for 2024

About PDF Association staff

A close-up look at how PDF does text | What would Patton think? | Word bug blocks PDF/UA validators, Happy Holidays & the PDFacademicBot for December 2024!

PDF in the WildDecember 16, 2024

Happy Holidays from the PDF Association!

PDF in the Wild for 2024

A close-up look at how PDF does text | What would Patton think? | Word bug blocks PDF/UA validators, Happy Holidays & the PDFacademicBot for December 2024!

PDF in the WildDecember 16, 2024

About PDF Association staff

A document icon labelled "2024". In July 2023 the PDF Association introduced "PDF in the Wild", a new monthly feature, including (but not limited to!) PDF in the news, new educational content published elsewhere, interesting tidbits, PDF in popular culture, and of course, the PDFacademicBot, a monthly roundup of academic writings focussing on the PDF format.

Check out the past issues of PDF in the Wild! Do you have a tip about something we should cover? Let us know!

A close-up look at how PDF does text

A visual representation of the journey through understanding text in PDFs. Jay Berkenbilt, the developer behind the open-source QPDF library and tools, recently completed his 5 part series on Medium explaining text in PDF:

Part 1, Text in PDF: Introduction covers background knowledge such as hexadecimal numbering, Unicode, and the basics of PDF text operators and graphics state.
Part 2, Text in PDF: Basic Operators, uses a simple, hand-coded PDF to display text. It uses one of the built-in fonts and a standard encoding and has examples of the basic text operators.
Part 3, Text in PDF: Unicode, uses a PDF file generated by word processing software (LibreOffice) to demonstrate how arbitrary Unicode characters are represented with font subsets and custom encoding. After this part, you will have all the fundamentals to understand text in just about any PDF file.
Part 4, Text in PDF: Fonts and Spacing, demonstrates the technique of adding color commands to PDF code in a binary-safe text editor to match the code with the rendered file so you can find your way around. It uses the same PDF file as Part 3 to examine the code behind kerning, ligatures, font changes, underlined text, and spacing.
Part 5, Text in PDF: Non-Latin Alphabets, using the same sample PDF, takes a closer look at a mathematical formula, text in non-Latin alphabets with more complex writing systems, and emoji. It concludes with a reflection on how it all works.

Season's Greetings from the LaTeX Project!

A Tagged PDF-themed holiday video!

What would General Patton do?

In its blog post entitled “I read your PDF”, Senticore discusses PDF’s role in the 3D space, noting how “...due to the 40-year-old system architecture, PDF data cannot be readily extracted into JSON or XML.”

In the same article they find occasion to highlight that the legendary General Patton “...hardly invented anything; instead, he was able to orchestrate the already matured stack of technologies … in the most creative and consistent manner.”

Indeed, PDF pages are… just pages. But PDF files can both structured text and graphics (via Tagged PDF) and contain data files too… as an associated file attachment. So, we ask… if the PDF has value, why not use it as a vehicle for any necessary JSON or XML?

Disruption is always fun to think about, but perhaps the question should be: “are we using our existing technology to its fullest potential?” That’s the question General Patton would have asked.

You can have your data, and read it too.

Why is that PDF failing PDF/UA validation?

We’ve reported a bug in Microsoft Word’s online service commonly used to create PDF files. It appears that an unembedded font is used for bullet and number symbols on all ordered or unordered lists on both Windows and Mac. PDF/UA-1 validators will flag this as an issue. The problem can normally be resolved in an additional PDF/UA conversion step that corrects font errors. As of this writing we are unaware of when this bug will be resolved in Word.

PDFacademicBot for December 2024

Abdumannon, B. and Qudratovich, S.S. (2024) ‘CONVERT IMAGES TO PDF IN C#’, Eurasian Journal of Mathematical Theory and Computer Sciences, 4(10), pp. 33–39. https://doi.org/10.5281/zenodo.13991413.

Allegrezza, S. (2024) ‘Addressing The Problem of File Formats Obsolescence: Italian Guidelines on File Format Conversion for the Long-Term Preservation of Electronic Records’, in iPRES 2024. iPRES 2024 Papers - International Conference on Digital Preservation, Ghent, Belgium. https://doi.org/10.21428/5676bf2d.cf6ac18d.

Eric Kwadwo Amissah, Kwabena Nduro, and Ernest Doe Kudjordjie (2024) ‘Optimizing Workflow Efficiency in Digital Printing Imposition: A Visual Tutorial on Design Impact Printing Methods’, International Journal For Multidisciplinary Research, 6(2), p. 13309. https://doi.org/10.36948/ijfmr.2024.v06i02.13309.

Infante, A. et al. (2025) ‘Would ChatPDF be advantageous for expediting the interpretation of imaging and clinical articles in PDF format?’, Journal of Medical Artificial Intelligence, 8(0). https://doi.org/10.21037/jmai-24-153.

Keyal, P.K. et al. (2024) ‘Efficient Information Digest: Exploring AI-Driven PDF Summarization through Streamlit, Open API, and LangChain’, in 2024 IEEE International Conference on Communication, Computing and Signal Processing (IICCCS). India: IEEE, pp. 1–5. https://doi.org/10.1109/IICCCS61609.2024.10763576.

Li XU et al. (2024) ‘Research on 3D Geospatial PDF Map Data Organization Model’, p. 16. [Chinese language] https://doi.org/10.13203/j.whugis20240227.

MAC, A.J. and Ragel, R.G. (2025) Document Image Analysis: Table Detection, Analysis And Format Preservation. BrownWalker Press. https://books.google.com.au/books?hl=en&lr=lang_en&id=UsouEQAAQBAJ

Richard Nana Kweenu Quayson (2024) THE IMPACT OF MULTIMODAL INPUTS ON USER ADOPTION IN STUDY PLANNING APPLICATIONS. B.Sc. Computer Science. Ashesi University. https://air.ashesi.edu.gh/server/api/core/bitstreams/2e1d58bc-12d2-4a1b-8b24-e29af1dd5ba3/content.

Sheth, R.K. and Parekha, C.D. (2025) ‘The Silent Threat: Safeguarding Against PDF-Based Malware With Intelligent Detection’, in Advanced Cyber Security Techniques for Data, Blockchain, IoT, and Network Protection. IGI Global Scientific Publishing, pp. 245–270. https://doi.org/10.4018/979-8-3693-9225-6.ch010.

Shigarov, А.О., (26 November 2024) ‘Table recognition in untagged PDF documents using PDF-specific features’, Journal of Computational Technologies, 6(29), pp. 125–146. [in Russian] https://doi.org/10.25743/ICT.2024.29.6.008.

Uematsu, T. et al. (November 2024) ‘A New Breast Density Assessment Method Using Portable Document Format’, Asian Pacific Journal of Cancer Prevention, 25(11), pp. 3947–3951. https://doi.org/10.31557/APJCP.2024.25.11.3947.

Weaver, C.A. (August 2024) ‘MCNP Output File Conversion (Los Alamos National Laboratory)’. 2024 MCNP User Symposium, Los Alamos, New Mexico, USA. https://www.osti.gov/servlets/purl/2440183

Featured articles

Discover pdfa.org

Key resources

Get involved

PDF in the Wild for 2024

A close-up look at how PDF does text

Season's Greetings from the LaTeX Project!

What would General Patton do?

Why is that PDF failing PDF/UA validation?

PDFacademicBot for December 2024