Digging for information by extracting data from a PDF document

About Nadine Schuppisser,

Extracting text from a PDF document is one of the most popular information retrieval function. But how about other information such as images, metadata and more? It can be simple – but also tricky.

Member NewsDecember 8, 2016

Digging for information by extracting data from a PDF document

Extracting text from a PDF document is one of the most popular information retrieval function. But how about other information such as images, metadata and more? It can be simple – but also tricky.

Member NewsDecember 8, 2016

About Nadine Schuppisser,

DISCLAIMER
The views expressed in this article are those of the author(s) and do not reflect the policies or positions of the PDF Association.

Extracting text from a PDF document is one of the most popular information retrieval function. But how about other information such as images, metadata and more? It can be simple - but also tricky.

Among the easiest things to extract you'll find metadata. The document metadata can usually be extracted as a short XMP stream. Even if the document contains an old fashioned information dictionary then the extraction of the key / value pairs is not a big deal. Similar are outlines (bookmarks), navigation aids such as named destinations, links and the like.

Pdftools counts more than 5,000 companies and organizations in 70 countries among its customers, making it one of the world’s leading producers of software solutions and developer components for PDF and PDF/A products. The product range supports the entire document flow, from raw materials to scanning processes through to signing…

Featured articles

Discover pdfa.org

Key resources

Get involved

Digging for information by extracting data from a PDF document