REGISTRATION
View the PDF Days Europe 2025 agenda
This presentaton is part of PDF Days Europe 2025.
Register now!
View our terms and conditions.
Tagged PDF in the Wild
Evaluating quality of real world Tagged PDFs
Excerpt: We analyse large volume of real world PDF coming from two public sources: Digital Corpora (8M PDFs) randomly crawled PDFs on the Web with recent creation dates These collections are checked for the availability of tags and the syntax validity of the structure tree. In addition, all tagged PDFs with table structure elements are compared against AI-based table recognition. Extra details are given to the processing architecture, as analyzing 8M PDFs in a reasonable time and cost is a separate chall … Read moreAbout the presenter(s)
Boris Doubrov is CEO of Dual Lab, the company specializing in product development services in the areas of Computer Graphics, CAD/CAM Modelling and other Science-intensive areas. Boris Doubrov holds a … Read more
Description
We analyse large volume of real world PDF coming from two public sources:
- Digital Corpora (8M PDFs)
- randomly crawled PDFs on the Web with recent creation dates
These collections are checked for the availability of tags and the syntax validity of the structure tree. In addition, all tagged PDFs with table structure elements are compared against AI-based table recognition. Extra details are given to the processing architecture, as analyzing 8M PDFs in a reasonable time and cost is a separate challenge.