NARA updates its guidance on OCR and PDF collections
New guidance on PDF Portable Collections and on digitizing paper. | US Army website features PDF/UA | PDFism! | Scratch-and-sniff PDF? | NARA’s new blog | Bit List says PDF is “vulnerable” | PDF/UA as bragging rights! | PDF AcademicBot for November 2024.New guidance for US federal agencies
The US Government’s archivist has posted improved guidance for the use of OCR, and new advice for agencies considering use of PDF’s Portable Collections (aka “PDF Portfolios”), a feature that allows PDF to be used to package together related content in different file formats.
The PDF Association provided NARA with expert review of this content to ensure vendor-neutrality and technical accuracy.
A US Army website features PDF/UA files
The US Army has launched a new website that gives soldiers a single access point to the service’s branch journals.
Line of Departure is a new user-friendly site that provides audio files, military podcasts and other resources.
“Our mission is to provide readers with a singular access point to engage with the wealth of knowledge and insights published across all Army branch journals at any time,” said Col. Todd Schmidt, director of Army University Press.
All articles available on Line of Departure are marked as PDF/UA compliant. The site also offers mobile-optimized HTML content and audio versions.
“Have you heard the good news about PDFism?”
From File types hanging out, a short video by Elle Cordova.
Should future PDF support aromas?
We've verified that the French Post Office has released a scratch-and-sniff postage stamp to celebrate the baguette.
The fragrant stamps depict a baguette wrapped in a blue, white and red ribbon, with the text “La baguette, de pain française” (“The baguette, the French bread”). Released on the Feast Day of St. Honoré, the patron saint of bakers and pastry chefs, almost 600,000 stamps were printed, each with the unmistakable aroma of freshly baked bread. “The baguette, the bread of our daily lives, the symbol of our gastronomy, the jewel of our culture”, said La Poste on its website.
We wonder if this stamp was printed via PDF or PDF/X - maybe using ISO 19593-1:2018 Graphic technology — Use of PDF to associate processing steps and content data — Part 1: Processing steps for packaging and labels?
NARA launches new “Fixity Check” blog
The U.S. National Archives and Records Administration has announced a new blog: Fixity Check. Fixity Check will highlight digital preservation efforts at NARA and offer a glimpse into the broad scope of digital preservation efforts.
The first Fixity Check blog post highlights NARA’s recent re-release of the Digital Preservation Framework: Digital Preservation Framework Re-Release Part 1: What is the Digital Preservation Framework?
“Bit List” of endangered formats classifies PDF as "vulnerable"
The Digital Preservation Coalition has released its interim “Bit List” of Digitally Endangered Species 2024 report. Page 171 lists PDF (including PDF/A) as “Vulnerable”, which is described as:
“Digital materials are listed as Vulnerable when the technical challenges to preservation are modest but responsibility for care is poorly understood, or where the responsible agencies are not meeting preservation needs.” (Appendix 1: Risk Classifications).
This is primarily because of poor practices and a lack of understanding, as the report explicitly notes:
“Due to the level of commercial, open-source tools that are available to assist preservation, the risk of loss was less persistent than previously suggested. Therefore, a Vulnerable classification was assigned as the most appropriate for all PDF formats as a whole. …
There is a lot of material produced and kept in PDF. Some of it is authoritative, in other words, the only available copy, while some of it is not. However, if it is the only copy and it is lost, it can have an impact on a lot of people
The challenge in evaluating the significance and impact of the loss of PDFs is that they’re quite often a surrogate of something else, whether a digitized record or a Word document, etc. Whether or not that record is retained may be a factor. We should also be considering PDF Portfolios, which are an extension of PDF 1.7. Portfolios contain embedded files and can include text documents, spreadsheets, PowerPoints, emails, Computer Aided Design (CAD) drawings.
Vulnerability also depends on if the PDF file conforms to the specific PDF/A standard or not. This is caused by a combination of 1) not conforming to the standard and 2) collection managers assuming that the file is resilient simply because it purports to be a PDF/A. This risk is less with the format and more with the understanding and experience in data management. Moreover, materials embedded in or attached to PDF/A-2 and PDF/A-3 may be at risk.”
The PDF Association continues to actively engage with the preservation community to provide informational and educational resources, such as the recent white paper on “Understanding Private Data in PDF/A”.
PDF/UA as bragging rights!
A clothing company publishes its first Sustainability Report… and is proud to note that the report conforms to PDF/UA!
We’re looking for a local Carrera retailer now…
PDFacademicBot for November 2024
Barpute, J.V. et al. (August 2024) ‘PDF Fusion: Revolutionizing Document Management with Chat Interfaces and Voice Recognition’, in 2024 5th International Conference on Electronics and Sustainable Communication Systems (ICESC). 2024 5th International Conference on Electronics and Sustainable Communication Systems (ICESC), pp. 1774–1783. https://doi.org/10.1109/ICESC60852.2024.10689987.
Challagundla, B.C. et al. (October 2024) ‘End-to-End Neural Embedding Pipeline for Large-Scale PDF Document Retrieval Using Distributed FAISS and Sentence Transformer Models’, Preprint. http://dx.doi.org/10.13140/RG.2.2.12625.34409.
Durmaz, A.R. et al. (2024) ‘An ontology-based text mining dataset for extraction of process-structure-property entities’, Scientific Data, 11(1), p. 1112. https://doi.org/10.1038/s41597-024-03926-5.
Haystack EU 2024 - Jo Kristian Bergum: What You See Is What You Search: Vision Language Models & PDFs (October 2024). YouTube video (45:09): https://www.youtube.com/watch?v=lURz8Tf3uQk.
Jaiswal, R. (2024) Publishing for all: Using LaTeX to help improve the accessibility of an open-access journal. Master of Science in Human-Computer Interaction. Rochester Institute of Technology. https://www.jptweb.com/wp-content/uploads/2024/10/RahulJaiswalCapstone.pdf.
Kumar, T. et al. (October 2024) ‘Navigating Complexity: A Tailored Question-Answering Approach for PDFs in Finance, Bio-Medicine, and Science’. Preprints. https://doi.org/10.20944/preprints202410.1395.v1.
Li, J. et al. (October 2024) ‘Research on PDF splitting and standardized renaming based on information intelligent extraction’, in Fourth International Conference on Computer Graphics, Image, and Virtualization (ICCGIV 2024). Fourth International Conference on Computer Graphics, Image, and Virtualization (ICCGIV 2024), SPIE, pp. 64–70. https://doi.org/10.1117/12.3044881.
Meng, N. and Zhou, X. (2024) ‘A Visualization Method of Large‐Size Vector Published Map’. Preprints. https://doi.org/10.20944/preprints202410.1282.v1.
Mittelbach, F. and Fischer, U. (2024) ‘LaTeX Tagged PDF project progress report for summer 2024’, TUGboat, 45(2), pp. 237–239. https://doi.org/10.47397/tb/45-2/tb140mittelbach-tagging24.
Neustaedter, E.R. and Wolfe, E.R. (2024) Geospatial PDF map of the compilation of GIS data for the mineral industries of select countries in the Indo-Pacific region, Open-File Report. 2024–1066. U.S. Geological Survey. https://doi.org/10.3133/ofr20241066.
Patil, Vinayak et al. (October 2024) ‘Integrating AI-Powered Chatbots and PDF Processing for Enhanced Document Management: A Full-Stack AI SaaS Approach’, Library Progress International, 44(3), pp. 9982–9992. https://www.researchgate.net/publication/384812556_Integrating_AI-Powered_Chatbots_and_PDF_Processing_for_Enhanced_Document_Management_A_Full-Stack_AI_SaaS_Approach.
Shahid, M.A., Safyan, M. and Pervez, Z. (2024) ‘Improvising the Malware Detection Accuracy in Portable Document Format (PDFs) through Machine Learning Classifiers’, Review of Applied Management and Social Sciences, 7(4), pp. 201–221. https://doi.org/10.47067/ramss.v7i4.373.
Ting Tin Tin et al. (October 2024) ‘Interactive ChatBot for PDF Content Conversation Using an LLM Language Model: LLM-Based PDF ChatBot’, International Journal of Advanced Computer Science and Application, 15(9), p. 8. https://doi.org/10.14569/IJACSA.2024.01509105.
Vargas Sepúlveda, M. et al. (2024) ‘tabulapdf: An R Package to Extract Tables from PDF Documents’, arXiv e-prints [Preprint]. https://doi.org/10.48550/arXiv.2409.14524.
Wang, A. (September 2024) Deep Learning Multimodal Extraction of Reaction Data. Thesis. Master of Engineering in Electrical Engineering and Computer Science. Massachusetts Institute of Technology. https://dspace.mit.edu/bitstream/handle/1721.1/157191/wang-wang7776-meng-eecs-2024-thesis.pdf?sequence=1&isAllowed=y.
Xie, X. et al. (October 2024) ‘PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling’. arXiv. https://doi.org/10.48550/arXiv.2410.05970.
Zagorodnikov, M.V. and Mikhaylov, A.A. (October 2024) ‘Recovering Text Layer from PDF Documents with Complex Background’, Proceedings of the Institute for System Programming of the RAS, 36(3), pp. 189–202. https://doi.org/10.15514/ISPRAS-2024-36(3)-13 (Russian language).