Featured articles

Image of PDF 2.0 Errata Collection 3 now available

PDF 2.0 Errata Collection 3 now available

PDF accessibility in R reporting workflows

The UK says a PDF is not an invoice. France just made it one.

Discover pdfa.org

Review publications from the PDF Association and ISO
The technical index lists critical resources for developers
Learn which companies are PDF Association members
Review hundreds of presentations from our events

Key resources

Try the new VS Code extension for PDF syntax
Check out our cheat sheets for PDF developers
Get PDF’s latest specification, ISO 32000-2 at no cost
Add ISO 32000-2 errata via our public GitHub repo, and check out the resolutions

Get involved

Discover the benefits of PDF Association membership
Join the PDF Association!
Review the PDF technical community’s working groups

How do you find the right PDF technology vendor?
Use the Solution Agent to ask the entire PDF communuity!

The PDF Association celebrates its members’ public statements
of support for ISO-standardized PDF technology.

Explore membership benefits

Explore membership benefits

PDF Association on pdfa.org

Related company page

Become a Member!

Become a Member!

OCR for PDFs – old news?

Thomas Zellmann

About Thomas Zellmann, PDF Association

Thomas Zellmann discusses the benefits of OCR and of making PDFs fully searchable, especially as input for AI systems.

ArticleDecember 2, 2019

Deutsche Version verfügbar

old typewriter with a piece of paper

OCR for PDFs – old news?

old typewriter with a piece of paper

Deutsche Version verfügbar!

Thomas Zellmann discusses the benefits of OCR and of making PDFs fully searchable, especially as input for AI systems.

ArticleDecember 2, 2019

Thomas Zellmann

About Thomas Zellmann, PDF Association

These days, scanning documents to Portable Document Format (PDF) or PDF/A, with text recognition, should be part of every company’s day-to-day operations. This, after all, is what makes it possible to convert scanned documents into text that can be selected, searched and edited. However, many managers are still unsure whether to upgrade an existing OCR solution or pick up a new one entirely.

The most important thing, of course, should be the quality of the text recognition – which needs to be as high as possible. If the OCR detects “Hel1o Wor d” instead of “Hello World”, for example, everything from simple searches to modern AI-driven applications simply will not work. One nice simple test that readers can try is to ask their OCR system to find instances of a word like “Christmas” when it has been split across two lines to read “Christ-mas”.

Meanwhile, when scanning and capturing old yellowing papers or faded faxes, businesses need to accept the probability of smears and similar problems. However, an ordinary business document can still be scanned with a high level of accuracy, provided professional scanning hardware is used. The state of the art in today’s color scans is a resolution of 300 dpi. Copying and pasting the text is enough for a simple visual check in Word. An application like the free Foxit Reader has a helpful feature for scans that directly displays text detected with OCR.

High performance

Another key criterion is the performance of the OCR engine. This comes into effect in particular in particularly high-volume mailrooms, which need to scan a very large number of pages in a short space of time. Professional OCR solutions – like those provided by the members of the PDF Association – use complex software algorithms to achieve the maximum possible recognition rates, but require more processing time. On the other hand, there are solutions that run faster but have worse recognition rates.

Good OCR engines are also distinguished by their use of dictionaries as comparison tools for the individual letters they recognize. Language support is important if handling documents in more than just English. As well as common Latin-alphabet languages like French and Spanish, support for others can be just as important. Languages like Arabic, Hebrew, Chinese, Japanese or Korean can incur additional costs, depending on the provider.

So-called zonal OCR represents another important function for some areas of application, particularly for large-format blueprints. These kinds of documents often have text blocks in their headers or footers, which need to be captured using OCR, but the plans themselves contain no text.

Excellent format

Last but not least, the additional output formats supported by a given OCR solution also matter. By default, OCR results are embedded into the PDF file. However, in many cases a separate file containing the OCR results is also needed, for example to index PDF documents. Alternatively, modern AI applications tend to require the content of a scanned document as input. The spectrum of formats ranges from plain .txt files to MS Office, ALTO-XML or other XML formats.

If new AI applications require even higher-quality content, or your current OCR system is inadequate, it can be worth taking a second look at a given solution.

Conclusion

PDF is an outstanding document format for embedding OCR and making scanned PDFs fully text searchable. But beyond this, PDF also offers a number of fascinating options for scanned documents, namely compression and reproduction across any operating system or platform.

Print Friendly, PDF & Email

WordPress Cookie Notice by Real Cookie Banner