PDF/A & PDF/UA, together at last!

About PDF Association staff

The latest veraPDF release candidate adds PDF/UA-2 support. LibreOffice creates PDF/A and PDF/UA together. A “white paper” gets PDF wrong, again. Visualizing Warichū.

PDF in the WildMarch 14, 2024

PDF/A & PDF/UA, together at last!

The latest veraPDF release candidate adds PDF/UA-2 support. LibreOffice creates PDF/A and PDF/UA together. A “white paper” gets PDF wrong, again. Visualizing Warichū.

PDF in the WildMarch 14, 2024

About PDF Association staff

veraPDF release candidate adds PDF/UA-2 draft support

The Open Preservation Foundation is pleased to announce that the veraPDF 1.26 release candidate is out now! veraPDF is an open-source, industry-supported PDF/A validator and part of the OPF reference toolset.

This release is made up of enhancements, technical maintenance, and bug fixes:

support added for software enforceable checks (see the PDF Association's Matterhorn Protocol) from the draft PDF/UA-2 specification;
the veraPDF REST API has been improved and better documented;
improvements to the REST Docker image;
improved CLI output (help, debug logs, text report, JSON report formatting);
improvements to the veraPDF GUI;
various bug fixes and improvements.

Learn more about the new fixes and features, read the release notes and get the release candidate (ZIP).

Latest LibreOffice supports both PDF/UA-1 and PDF/A in a single export

The latest LibreOffice 24.2 release provides new options when exporting documents - now exported PDF files can be created in conformance with both PDF/A and PDF/UA-1!

Our quibbles? It’s too bad that LibreOffice didn’t also choose to support PDF/A’s conformance level “a”, which it certainly could have done given PDF/UA. Additionally, when embedding the ODF file (see the “Hybrid PDF” option), the software should force the “PDF/A version” to be limited to PDF/A-3 - at present the software is perfectly happy to embed the ODF in a PDF/A-2 file, which is invalid. We have reported this to LibreOffice and hope they will issue a fix soon.

Although exporting dual conformant PDF/A and PDF/UA files is a great new capability, software is always a work in progress.

Another “white paper” gets PDF wrong

Wiley / Atypon has just posted a “white paper” (registration required), the text of which is viewable on NISO’s site. This document makes several unfortunate claims and reads like a sales pitch. Let’s have a look…

First, the paper references WCAG and EPUB accessibility standards but fails to mention PDF/UA, an international ISO standard. There’s not much excuse for this, as PDF/UA is broadly accepted, explicitly referenced in US federal regulations, implied by many accessibility regulations, and widely implemented by software vendors.

The paper goes on to state: “Because while newer options can improve PDF accessibility, a reflowable ePub will always be more accessible than a PDF.” This is misleading.

FIrst, a significant fraction of content CANNOT be properly represented in reflowable EPUB files. In particular, content that requires fixed relationships (commonplace in technical documents) and content that spans pages (as tables and diagrams often do).

Second, PDF can be just as accessible as reflowable ePub if it includes PDF’s accessibility features. Software that understands PDF/UA-1 or well-tagged PDF can extract the file’s contents for representation in a reflowable format such as HTML. This use case is actually the purpose of the Deriving HTML from PDF specification.

We note that Wiley / Atypon cared enough to make their white paper available as a PDF file… and we also note that, contra their recommendation, they didn’t bother to also post it as an EPUB. It's unfortunate, however, that the tagging of their PDF file is really bad, making the file inaccessible to users who require assistive technology (AT) in order to read. Among its sins…

The role mapping is fragmentary - some headings are mapped correctly, others aren’t. Accordingly, AT users will only be able to find some of the headings.
Footnotes are tagged as a single block at the bottom of the page - there’s no way for AT users to navigate to footnotes and back to the text.
Lists are indicated but entirely untagged within the <L> tag. AT users will not be able to tell where one list-item <LI> ends and the next begins.
It’s wrong to map the document’s <Title> tag to <H1>… but then tag subsequent headings as <H1> as well! AT users will be confused as to the structure of the document.

The authors should consider adding PDF/UA to their awareness… and using PDF/UA-aware tools to ensure that their PDF files are, in fact, accessible, as this white paper certainly isn't. This is especially true when the document’s subject is accessibility, as in this case!

Visualizing Warichū

Warichū and Ruby elements are Asian language typesetting features that are often poorly understood by non-native speakers as they don’t really have an equivalent in Latin-based typography. If you are unfamiliar with Warichū, here's an excellent explainer.

Support for Warichū and Ruby in Tagged PDF was added to PDF 1.5 in 2003. You can find the latest information on PDF support Warichū and Ruby in subclause 14.8.4.7.3 of the ISO 32000-2 (PDF 2.0) specification (available at no cost). WTPDF and PDF/UA-2 provide guidance on the use of these structure elements for both reuse and accessibility.

GWG to present latest PDF information at graphic arts conference

The Ghent Workgroup will present on PDF-related topics at the 9th Conference of Information and Graphic Arts Technology to be held at the Faculty of Natural Sciences and Engineering in Ljubljana, Slovenia from 11 to 12 April 2024:

The State and Future of Print technology and processes and the Role of the GWG in the Print Community, Thursday 11 April 2024 at 9.15 am – David Zwang (Chair GWG)
Preflight and its role in print production, Thursday 11 April 2024 at 10 am – Christian Blaise (Marketing Officer GWG)
PDF2.0 and PDF/X-6, Friday 12 April 2024 at 9 am – Freddy Pieters (member GWG)

The two-day scientific conference will provide a unique opportunity to bring together researchers and experts from different fields of graphic communications. Before the meeting and conference, Ghent Workgroup will also present a forum for the students of Ljubljana University, where the future of PDF will be discussed.

Our PDF extension for VSCode gets “featured”

Thank you to everyone who has downloaded our new PDF COS Syntax extension for Visual Studio Code. It’s now a featured extension in the Visual Studio Code marketplace. Be sure to leave a review!

Screenshot of the VSCode marketplace showing PDF COS Syntax as Featured.

PDFacademicBot for March, 2024

Al-Fayoumi, M. et al. (2024) ‘XAI-PDF: A Robust Framework for Malicious PDF Detection Leveraging SHAP-Based Feature Engineering’, The International Arab Journal of Information Technology, 21. https://doi.org/10.34028/iajit/21/1/12.

Aniket Holkar et al. (2024) ‘Unlocking the depth analysis of PDF using artificial intelligence, large language model, langchain’, International Research Journal of Modernization in Engineering Technology and Science, 6(2), p. 3. https://www.doi.org/10.56726/IRJMETS49113.

Anantharaman, P. et al. (May 2023) ‘PolyDoc: Surveying PDF Files from the PolySwarm network’, in. 2023 IEEE Security and Privacy Workshops (SPW), IEEE Computer Society, pp. 117–134. https://doi.org/10.1109/SPW59333.2023.00017 and https://langsec.org/spw23/papers.html#polydoc.

Benko, Samuel (May 2023) A tool for checking texts extracted from PDF. Bachelor Thesis, Faculty of Informatics. Masaryk University. https://is.muni.cz/th/e233g/Bakalarska_praca_Tool_for_text_extraction-final_Archive.pdf.Hughes, K. and Black, M. (2024) ‘A Content-Based Approach to Data Carving Portable Document Format (Pdf) Files’. Rochester, NY. https://doi.org/10.2139/ssrn.4711141.

Konstantin Nikolaev (2023) ‘Web-based visualization of semantic annotation of mathematical PDF documents’, pp. 10. https://damdid2023.hse.ru/mirror/pubs/share/867944227.pdf.

Lanasri, D. (2023) ‘Patterns Recognition Approach for Newspapers Analytics: Case of Algerian PDF Newspapers’, in 12th Seminary of Computer Science Research at Feminine. 2th Seminary of Computer Science Research at Feminine, Constantine, Algeria: CEUR Workshop, p. 12. Available at: https://ceur-ws.org/Vol-3616/paper7.pdf.

Mittelbach, F. and Fischer, U. (2024) ‘Enhancing LaTeX to Automatically Produce Tagged and Accessible PDF’, in. : The 5th International Workshop on "Digitization and E-Inclusion in Mathematics and Science 2024, Tokyo, Japan, p. 7. https://www.researchgate.net/publication/378521418_Enhancing_LaTeX_to_Automatically_Produce_Tagged_and_Accessible_PDF.

Riwanda, A., Ridha, M. and Islamy, M.I. (2024) ‘Empowering Asynchronous Arabic Language Learning Through PDF Hyperlink Media’, The International Review of Research in Open and Distributed Learning, 25(1), pp. 66–88. https://doi.org/10.19173/irrodl.v25i1.7425.

Sharma, A. and Babbar, H. (2023) ‘The State of Cybersecurity in Digital Landscape to Unmask PDF Malware Attacks’, in 2023 3rd International Conference on Smart Generation Computing, Communication and Networking (SMART GENCON). 2023 3rd International Conference on Smart Generation Computing, Communication and Networking (SMART GENCON), pp. 1–6. https://doi.org/10.1109/SMARTGENCON60755.2023.10442942.

Featured articles

Discover pdfa.org

Key resources

Get involved

PDF/A & PDF/UA, together at last!

veraPDF release candidate adds PDF/UA-2 draft support

Latest LibreOffice supports both PDF/UA-1 and PDF/A in a single export

Another “white paper” gets PDF wrong

Visualizing Warichū

GWG to present latest PDF information at graphic arts conference

Our PDF extension for VSCode gets “featured”

PDFacademicBot for March, 2024