Check your PDFs before you ship ‘em!
PDF in the WildFebruary 24, 2026
PDF in the WildFebruary 24, 2026
About PDF Association staff
The world is probably getting tired of hearing about users’ redaction mistakes, but covering content with black boxes is not the only way to get into trouble!
What possible explanation is there for sharing PDF documents without checking for comments, as in this case? We don’t want to say, but…
Anyone responsible for checking a PDF before it goes out must ensure they use a PDF viewer that understands basic PDF features, such as annotations (which, by the way, quite a few mobile device readers do NOT).
A PDF icon doesn’t mean that the link goes to a PDF
Once again, the ubiquity and trust in PDF documents is abused to trick users into dangerous situations - even when no PDF is used!
The emails in this phishing campaign don’t attach a document directly but include links to a file hosted on IPFS (InterPlanetary File System), a decentralized storage network increasingly used by cybercriminals as it can be accessed through normal web gateways. Those files are virtual hard disks that, when opened, mount as a local disk, bypassing some Windows security features. Inside the disk is a Windows Script File (WSF) purporting to be the expected PDF: When the user opens it, Windows executes the code in the file thus leaving the computer open to exploitation by remote users.
Register for PDF Week London
Joining us for PDF Week London coming up in May? We’ve posted some suggested hotels within a short walk of our meeting venue.
Save the date for ISO Week in South Korea
Thanks to our colleagues at Hancom, PDF Week for late 2026 will be held in Incheon during the week of October 12-16, 2026.
Beyond our regular meetings, members of the PDF Association and ISO committees meeting in Korea will offer a half-day seminar open to the public to provide information on PDF’s current status and future direction.
LLM text extraction paradox
This month’s PDFacademicBot includes a new paper, Speed, Simplicity, and Fidelity: A Multi-Metric Benchmark of Python PDF Extraction Libraries for RAG Pipelines, by A. Subramanian. An interesting (especially for those extracting text from PDFs for LLMs) finding:
We uncover an extraction paradox: tools specifically designed for LLM consumption (pymupdf4llm, Docling) significantly underperform simpler rule-based extractors (PyMuPDF, pypdfium2, pypdf) on text fidelity, while being 100–1,600× slower.
Share tokens instead of documents?
We’re hearing a lot of ideas about enhancing PDF files to make them easier for AIs to understand, and that’s great. We’re confident, however, that exchanging tokens instead of documents won’t go far.
Replace PDF with… Markdown? 😆😂😭
This article (in Dutch) has received attention on LinkedIn, where it was originally posted. For an author who self-describes as a “Technology Philosopher”, this is an incredibly naive understanding of the very real challenges of complex typography, diverse requirements of human communication, and the realities of real-world document workflows, including requirements for archival content.
As we’ve previously reported, long-term understanding of even “plain text” requires preservation of its context. One only has to look at EBCDIC, GB 18030, and the constant evolution of Unicode to understand that human communication goes far beyond “simple text”. Archivists, experienced software developers, and even comedians understand the importance of typefaces and appreciate the real-world challenges of dealing with legacy schema-less XML, undocumented JSON, or HTML files from the browser wars era.
In collaboration with ISO TC 171 SC 2 WG 10, our DocRM Liaison Working Group is helping develop the ISO 20271-1 reference model to address misunderstandings and misconceptions about textual preservation across all file formats.
Epstein PDFs analysis continues
Thanks to our coverage of the Epstein PDFs, reporters at The Verge asked for our thoughts on the oddities they and others are finding in emails converted to PDFs. Read more at theverge.com.
This article summarises what redaction is and lists some well-known redaction failures.
Redacting content versus hiding content, explained
If you are trying to remove content from a PDF, it’s essential that you understand the difference between “hiding” and “removing” content. The PDF Association’s new video explains the distinction.
Chrome now supports JPEG-XL
Further to our past announcements about selecting JPEG-XL as the preferred HDR format for future PDF and Chrome engineers reversing their previous decision against JPEG-XL, Chrome v145 now includes support for JPEG-XL. Try opening this JPEG-XL test page to see JPEG-XL in action.
The re-evaluation began in November 2025, when the Chromium team announced its resumption. Several factors were decisive: Apple had implemented JPEG XL support in Safari, Mozilla had abandoned its neutral stance, and the PDF Association had included the format in PDF specifications as recommended in October 2025. Technically, Chromium plans to integrate “jxl-rs,” a Rust-based JPEG XL decoder. Google is already using the format in practice: the Google Cloud Platform DICOM API uses JPEG XL to reduce file size by 20 percent.
Super Mario 64 in a PDF
Maybe you’ve played Doom in a PDF. Or taken a cue from Michael Demay, and ditched the PS5 for PDF 2.0.
Not everyone loves a first-person shooter. Perhaps you are more of a Mario fan? Now you can play Super Mario 64 on your favorite substrate: PDF!
Brotli gains media attention
The PDF Technical Working Group’s work on promoting Brolti as a new general compression algorithm was picked up by various news outlets in Germany and the US. Is your PDF technology ready for this “breaking change” that steps up PDF file size reductions?
PDFacademicBot for February 2026
The PDFacademicBot brings academic research on PDF and related technologies to the industry’s attention.
Açıkgöz, Z., Arslan, S., and Arslan, R.S. (Nov. 2025) “Enhancing File Security with an Optimized Auto-Classification Framework Based on Learning Models,” in 2025 9th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), pp. 1–6. https://doi.org/10.1109/ISMSIT67332.2025.11268095.
Bouwel, J.V. and Kock, T.D. (Jan. 2026) “FROM PDF TO PRESERVATION: USING AUTOMATED TEXT ANALYSIS TO UNCOVER STONE USE MOTIVATIONS IN 19TH CENTURY PUBLIC ARCHITECTURE,” in. 15th International Congress on the Deterioration and Conservation of Stone, pp. 600–602. https://repository.uantwerpen.be/docman/irua/1c215cmotoM3f.
Hanson, M.D. (2026) “Confronting the Urgent Challenge of Using LaTeX to Create Accessible Course Materials,” ChemRxiv, 2026 (0129). https://doi.org/10.26434/chemrxiv.10001708/v1.
Jain, R. and Kumar, S.R. (Jan. 2026) “Perfecting Tax Returns Like Code: A Verifier-Swarm, Codebase-Style Architecture that Solves TaxCalcBench,” p. 3. https://prime-meridian-papers.s3.us-west-2.amazonaws.com/solving_taxes_like_code.pdf
Kuligin, L. (Jan. 2026) “Layout-Aware Text Extraction Using Heuristic Segmentation and LLM-Based Refinement.” Technical Disclosure Commons - Defensive Publication Series. https://www.tdcommons.org/cgi/viewcontent.cgi?article=10508&context=dpubs_series.
Lam, D., Li, L. and Gabrielson, A. (Jan. 2026) “Parser Weakness Enumeration,” p. 8. https://drive.usercontent.google.com/download?id=1VUPYR9yTvnQgiSpj3CrMrLKbnZdWP9xu&export=download&authuser=0
Prakash, P. et al. (Nov. 2025) “Revolutionizing PDF Q&A with Local LLMs and Privacy-Enhanced Retrieval-Augmented Generation,” in 2025 International Conference on Green Energy, Computing and Sustainable Technology (GECOST). 2025 International Conference on Green Energy, Computing and Sustainable Technology (GECOST), pp. 1–6. https://doi.org/10.1109/GECOST66002.2025.11324623.
Rigal, B. et al. (Feb. 2026) “Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion.” arXiv. https://doi.org/10.48550/arXiv.2602.11960.
Sharmila S. P (Jan. 2026) “PDFInspect: A Unified Feature Extraction Framework for Malicious Document Detection.” arXiv. https://doi.org/10.48550/arXiv.2601.12866.
Silaen, C.J. et al. (Nov. 2025) “Automatic Generation of Presentation Slides from PDF Using Retrieval-Augmented Chatbot,” in 2025 IEEE 11th International Conference on Computing, Engineering and Design (ICCED), pp. 1–6. https://doi.org/10.1109/ICCED68324.2025.11324852.
Subramanian, A. (Feb. 2026) Speed, Simplicity, and Fidelity: A Multi-Metric Benchmark of Python PDF Extraction Libraries for RAG Pipelines. https://doi.org/10.13140/RG.2.2.29289.56168.
Vasepalli, K. et al. (2025) “Intelligent Model for PDF Malware Detection” in Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies. INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION, COMMUNICATION, AND COMPUTING TECHNOLOGIES, Nagapattinam, India: SCITEPRESS - Science and Technology Publications, pp. 800–806.: https://doi.org/10.5220/0013943800004919.
Wallwater, I. et al. (Jan. 2026) “ChemSIE: From Document Based Records to Machine Actionable Experimental Data,” ChemRxiv, 2026(0121). https://doi.org/10.26434/chemrxiv.10001481/v1.
Waseem, A., Zia, M.A.M. and Adedayo, O.M. (Jan. 2026) “A Comparative Study of Forensic File Type Identification Methods for Tool Type Identification,” IEEE Open Access, p. 14. https://doi.org/10.1109/ACCESS.2026.3655461.
Zhu, W., Mazeen Mujthaba, M., and Wong, K. (Jan. 2026) “Reversible data hiding in PDF files by overlapping characters,” Journal of Information Security and Applications, 97, p. 104375. https://doi.org/10.1016/j.jisa.2026.104375.


