Is your company “first-class”?
What do “first class companies” do? Japanese media covers the PDF Association in Tokyo. Firefox adds experimental alt. text generation to PDF. JPMorgan are thinking about layout and semantics in their LLMs. Gnome has a new PDF viewer. Updates from the Open Preservation Foundation. And of course, the PDFacademicBot for June, 2024!Contents
- Is your company “first-class”?
- What do "first class companies" do?
- Japanese media covers PDF Association event
- Firefox adds experimental facility to add alt text to PDFs
- JPMorgan describe semantic and layout aware LLM
- Gnome first release of new PDF viewer - “Papers”
- Updates from the Open Preservation Foundation
- PDFacademicBot for June 2024
What do "first class companies" do?
According to the Standardization Administration of the People’s Republic of China, Chinese policymakers are often quoted as saying:
"First-class companies do standards. Second-tier companies do technology. Third-tier companies do products."
(一流的企业做标准,二流的企业做技术,三流的企业做产品)
Irrespective of geopolitics, the level of importance that one of the world’s biggest economies attributes to standards resonates with the mission and vision of the PDF Association, and our drive to involve more stakeholders in the collaborative, vendor-neutral, open forums that manage, maintain and develop technical standards for PDF technology.
Beyond the benefits of networking with technical experts and seeing "into the future" of the format, active involvement with the PDF Association demonstrates an organization’s ongoing commitment to PDF technology, forward-leaning and strategic thinking and industry leadership overall.
Japanese media covers PDF Association event
While in Tokyo for PDF Week in early May, the PDF Association’s Chair, Raf Hens, CEO Duff Johnson, and CTO Peter Wyatt were interviewed by Yuuki Yamadai for MyNavi’s Tech+, a leading Japanese business information site for users of information technology.
The conversation largely focussed on PDF in the context of AI. You can read the interview here (Japanese), and additional coverage here and here.
Firefox adds experimental facility to add alt text to PDFs
As reported in this Mozilla hacks blog post:
“Firefox 130 will introduce an experimental new capability to automatically generate alt-text for images using a fully private on-device AI model. The feature will be available as part of Firefox’s built-in PDF editor, and our end goal is to make it available in general browsing for users with screen readers.”
This capability works entirely within Firefox and does not require sending images to a server.
“Starting in Firefox 130, we will automatically generate an alt text and let the user validate it. So every time an image is added, we get an array of pixels we pass to the ML engine and a few seconds after, we get a string corresponding to a description of this image (see the code).
The first time the user adds an image, they’ll have to wait a bit for downloading the model (which can take up to a few minutes depending on your connection) but the subsequent uses will be much faster since the model will be stored locally.
In the future, we want to be able to provide an alt text for any existing image in PDFs, except images which just contain text (it’s usually the case for PDFs containing scanned books).”
While not the first PDF technology to apply AI to this use-case, we’re excited to see an increased awareness of PDF accessibility from a major browser team!
JPMorgan describe semantic and layout aware LLM
As our recent Japanese media coverage illustrated (above), we’ve been critical of the many LLMs that only operate on simple streams of plain text (at best) or just N-grams of words whether this be from text extraction or OCR, so we were excited to see this preprint article from JPMorgan titled “DocLLM: A layout-aware generative language model for multimodal document understanding” which adds far richer document understanding and document intelligence:
“Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a crucial role in comprehending these documents effectively. In this paper, we present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout. Our model differs from existing multimodal LLMs by avoiding expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure. Specifically, the cross-alignment between text and spatial modalities is captured by decomposing the attention mechanism in classical transformers to a set of disentangled matrices. Furthermore, we devise a pre-training objective that learns to infill text segments. This approach allows us to address irregular layouts and heterogeneous content frequently encountered in visual documents. The pre-trained model is fine-tuned using a large-scale instruction dataset, covering four core document intelligence tasks. We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.”
The researchers also describe and benchmark a new corpus called “BizDocs” as “a collection of business entity filings”. Unfortunately, both the LLM technology and the BizDocs corpus are not yet public so it is hard to judge, but as the JPMorgan researchers state “Document intelligence is inherently a multi-modal problem with both the text content and visual layout cues being critical to understanding the documents“ - we couldn’t agree more!
Gnome first release of new PDF viewer - “Papers”
The Gnome desktop environment for Linux recently announced the first release of its replacement PDF viewer for Evince - Papers. Papers had its first proper release published on Flathub (the Linux app store) as it looks for wider user and community feedback. Papers is a fork of Evince and uses Poppler beneath the covers for PDF processing. Reportedly, it has been ported from GTK3 to GTK4 and also brings an improved user interface and other refinements.
Updates from the Open Preservation Foundation
The latest Summer 2024 e-newsletter from the Open Preservation Foundation (OPF) announces:
- Multiple software updates, including veraPDF’s 1.26 second release candidate supporting PDF/UA-2, the first bugfix release for Jpylyzer 2.2.1 (an open source validator and feature extractor for JPEG2000 images), and a release candidate updated for ViPER 1.2 (OPFs virtual preservation container with all their software tools).
- Micky Lindlar’s (from Leibniz Information Centre for Science and Technology) most recent blog post entitled ‘A date with PDF-HUL-133 “Improperly formed date’ describes a common issue encountered by many JHOVE users with malformed PDF dates.
PDFacademicBot for June 2024
Ashlin Deepa R. N. et al. (April 2024) ‘An Intelligent Invoice Processing System Using Tesseract OCR’, in 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), pp. 1–6. https://doi.org/10.1109/ADICS58448.2024.10533509.
Buakhao, R. (May 2024) Extracting Known Side Effects from Summaries of Product Characteristics (SmPCs) Provided in PDF Format by the European Medicines Agency (EMA) using BERT and Python. Masters of Computer Science. Uppsala University. https://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-528716
Correia, J. et al. (2024) ‘Unveiling the Potential of a Conversational Agent in Developer Support: Insights from Mozilla’s PDF.js Project’, p. 9. https://doi.org/10.1145/3664646.3664758.
Chunduru, E.S.P. and Koppolu, N.R. (2021) ‘Digital Crime Investigation: Unleashing the Secrets of Data by Attributing the Origination and Authorship’, International Journal of Engineering Research, 9(13), p. 7. https://bit.ly/3Vfdfyl
Hossain, G.M.S., Deb, K. and Sarker, I.H. (2024) ‘An Enhanced Feature-Based Hybrid Approach for Adversarial PDF Malware Detection’, in 2024 6th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT), pp. 101–106. https://doi.org/10.1109/ICEEICT62016.2024.10534412.
Hughes, K. and Black, M. (2024) ‘An Alternative Approach to Data Carving Portable Document Format (PDF) Files’, Journal of Cybersecurity Education, Research and Practice, 2024(1). https://doi.org/10.62915/2472-2707.1188.
Hujaleh, H. and Whyte, J. (2024) ‘Narrowing the Lens: Preservation Assessment for Digital Manuscripts’, 25(11), RBM: A Journal of Rare Books, Manuscripts, and Cultural Heritage, pp.39-49. https://doi.org/10.5860/rbm.25.1.39.
Musleh, D.A. (2024) ‘Rule-Based Information Extraction from Multi-format Resumes for Automated Classification. | Mathematical Modelling of Engineering Problems | EBSCOhost’, Mathematical Modelling of Engineering Problems, 11(4), pp. 1044–1052. https://doi.org/10.18280/mmep.110422.
Prem Jacob, T., Bizotto, B.L.S. and Sathiyanarayanan, M. (2024) ‘Constructing the ChatGPT for PDF Files with Langchain – AI’, in 2024 International Conference on Inventive Computation Technologies (ICICT). 2024 International Conference on Inventive Computation Technologies (ICICT), pp. 835–839. https://doi.org/10.1109/ICICT60155.2024.10544643.
Saha, A., Blasco, J. and Lindorfer, M. (May 2024) ‘Exploring the Malicious Document Threat Landscape: Towards a Systematic Approach to Detection and Analysis’, p. 12. https://martina.lindorfer.in/files/papers/maldocs_worma24.pdf.
Siu, A. et al. (April 2024) ‘ATLAS: A System for PDF-centric Human Interaction Data Collection’, in K.-W. Chang, A. Lee, and N. Rajani (eds) Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations). Mexico City, Mexico: Association for Computational Linguistics, pp. 87–96. https://aclanthology.org/2024.naacl-demo.9
Stankovski, A. and Garijo, D. (2024) ‘RepoFromPaper: An Approach to Extract Software Code Implementations from Scientific Publications’, p. 14. https://dgarijo.com/papers/stankovsi_2024.pdf.
Tanaka, S. et al. (May 2024) ‘KnowledgeHub: An end-to-end Tool for Assisted Scientific Discovery’. arXiv. https://doi.org/10.48550/arXiv.2406.00008.
Vanshika Sharma, Vatsla Gupta, Vishu Tyagi, Abhinav, & Md. Shahid. (May 2024). ‘Advance PDF To Audio Converter’. Educational Administration: Theory and Practice, 30(5), 10996–11001. https://doi.org/10.53555/kuey.v30i5.4876 https://kuey.net/index.php/kuey/article/view/4876
Wilkening, D. et al. (May 2024) ‘ACCSAMS: Automatic Conversion of Exam Documents to Accessible Learning Material for Blind and Visually Impaired’. Accepted at ICCHP 2024. https://doi.org/10.48550/arXiv.2405.19124.
Wiharja, S., Pradeka, D. and Suteddy, W. (May 2024) ‘Comparative Study of the Effect of Datasets and Machine Learning Algorithms for PDF Malware Detection’, Digital Zone: Jurnal Teknologi Informasi dan Komunikasi, 15(1). https://doi.org/10.31849/digitalzone.v15i1.19744.
Zhou, J. and Jing, R. (April 2024) ‘Malicious Documents Detection and Classification Using Hybrid Boosted-Support Vector Machine with High Dimensional Features’, in 2024 Third International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE). 2024 Third International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE), pp. 01–04. https://doi.org/10.1109/ICDCECE60827.2024.10549567.