PDF Association logo

Discover pdfa.org

Key resources

Get involved

How do you find the right PDF technology vendor?
Use the Solution Agent to ask the entire PDF communuity!
The PDF Association celebrates its members’ public statements
of support
for ISO-standardized PDF technology.

Member Area

REGISTRATION


View the PDF Days Europe 2025 agenda

This presentaton is part of PDF Days Europe 2025.
Register now!

View our terms and conditions.



Tagged PDF in the Wild

Evaluating quality of real world Tagged PDFs

Excerpt: We analyse large volume of real world PDF coming from two public sources: Digital Corpora (8M PDFs) randomly crawled PDFs on the Web with recent creation dates These collections are checked for the availability of tags and the syntax validity of the structure tree. In addition, all tagged PDFs with table structure elements are compared against AI-based table recognition. Extra details are given to the processing architecture, as analyzing 8M PDFs in a reasonable time and cost is a separate chall … Read more
About the presenter(s)

Boris Doubrov is CEO of Dual Lab, the company specializing in product development services in the areas of Computer Graphics, CAD/CAM Modelling and other Science-intensive areas. Boris Doubrov holds a … Read more


Boris Doubrov
Dual Lab sprl

Description

We analyse large volume of real world PDF coming from two public sources:

  • Digital Corpora (8M PDFs)
  • randomly crawled PDFs on the Web with recent creation dates

These collections are checked for the availability of tags and the syntax validity of the structure tree. In addition, all tagged PDFs with table structure elements are compared against AI-based table recognition. Extra details are given to the processing architecture, as analyzing 8M PDFs in a reasonable time and cost is a separate challenge.



WordPress Cookie Notice by Real Cookie Banner