How the inclusion of explicit information will empower the PDF standard
PDFs have been around for nearly 30 years now and the very ability to ensure perfect representation of graphic content has led the PDF format to become a very powerful ISO standard. The comprehensive specification of fonts, device-independent colorspaces, transparency, and numerous pre-press features has come to guarantee faithful reproduction of authored content. When coupled with standards for compression, encryption, external files, and various ISO formats (e.g., PDF/X, PDF/A, and PDF/E), PDF is a powerful tool to capture documents with perfect visual fidelity. However, the introduction of PDF/UA in 2012 exposed a serious shortcoming in the authoring of PDF: authoring applications often include very little explicit information despite their graphic-rich reproductions. So much information is often lost in a PDF: font encoding information, structural information, table data, mathematical formulae, original resources, layout schemas to name just a few. Even today's best AI solutions often struggle with the synthesis of document information based on implicit clues. In this talk, the hidden gems in the PDF format will be discussed to show how authoring applications should create semantically rich PDFs containing explicit information that will allow for better readability, repurposing, accessibility, information extraction, and document understanding. Examples will be provided using current authoring applications that show exemplar creation of valuable PDFs with explicit information that will extend the PDF standard into the next 25 years. Shawn Gaither has been working on PDF structure for 25 years at Adobe, and has led efforts in OCR, eBooks, accessibility, document comparison, form field detection, and most recently, structure detection for Liquid Mode on Acrobat mobile.
Slides download: https://pdfa.org/wp-content/uploads/2022/04/1530-Gaither.pdf