How you see PDFs versus how a search engine sees PDFs
Instead of retrieving and searching each file in its associated application, a search engine needs to review all files together in binary format, opening the door to concurrent searching and advanced search options working across the entire repository.PDF was originally developed as a printer-like file format. When you review a PDF file, you typically look at the document from inside a PDF viewer like Adobe Reader, and the PDF file ordinarily appears just as that file would print. PDF viewers such as Adobe Reader also give you the option to search for a word or phrase.
But what if you had to search across millions of PDFs? Pulling up each PDF individually in a PDF viewer and then individually searching each for specific keywords would hardly be an efficient approach. Further, what if the collection also includes a mix of other formats like “Office” documents, email files, compressed archives and other miscellaneous data?
For efficiency, instead of individually retrieving and searching each file in its associated application, a search engine needs to review all files together in binary format. Reviewing all files at once in binary format also opens the door to multiuser concurrent searching and advanced integrated search options across the entire repository.
Like most modern data types, if you look at a PDF in binary format, it is difficult to discern the words at all through the fog of binary codes. The job of a search engine is to parse the binary format to retrieve the text. The software component that does this parsing goes by the name document filters.
Every file type, including PDF, has its own unique specification for how it stores text in binary format. Before parsing a file, the document filters need to first figure out which binary format specification applies. If the document filters attempt to parse a PDF file using a Microsoft Word specification for example, the result will be gibberish. And each individual file would require its own specification even if it appears inside a compression archive like .ZIP or .RAR.
You might think that using the filename extension would be an easy way to figure out which specification to apply. A .PDF filename extension would indicate a PDF file format, and a .DOCX format would indicate a Microsoft Word document. However, what if someone gives a PDF file a .DOCX filename extension and conversely labels a Microsoft Word document with a .PDF filename extension?
The way a search engine like dtSearch handles this issue is to look at the binary file header to determine the file type. That way, regardless of the filename extension, the document filters can apply the correct specification. Interestingly, the recent development of the PDF 2.0 format resulted in a binary file header update. The dtSearch document filters had to make a binary format heading recognition adjustment to distinguish PDF 1.x from PDF 2.0, and to distinguish all versions of PDF from other data types.
But figuring out which file format specification to apply is just the first step. Parsing a file format like PDF requires not only distinguishing the text itself, but also making use of the language encoding information that accompanies the text. In the same PDF, you can have English text followed by Chinese followed by Arabic, and each can have its own separate encoding.
As noted, discerning the relevant text in a binary format PDF view is typically a lot harder for the human viewer than seeing that same document in a PDF application. On the other hand, a binary format PDF can enable a search engine and its document filters to “see” text that might easily escape scrutiny in an associated application display. Understanding this is important particularly in an investigation context. Following are some examples where a search engine could “discover” text that a PDF viewer might obscure.
- Since PDF was originally developed as a printer-like file format, a default PDF viewer display may “miss” text appearing outside of the printed page boundary. That extra text, however, would be readily apparent in binary format.
- PDFs can also include metadata content, which an ordinary associated application view might obscure. In binary format however, that additional metadata would be fully available either in a full-text search of the data collection or in a search of just relevant metadata fields.
- PDFs can have embedded objects such as Microsoft Office files, which a default PDF view may show only as a paperclip. However, document filters looking for embedded objects can readily parse the full embedded document content.
- A PDF viewer display can also hide “white on white” or “black on black” text. In contrast, such text will be “clear as day” in binary format.
- While similar to “black on black” text, sometimes end users attempting to redact text will end up with a black rectangle over the text that visually blacks out the text inside their PDF viewer, without actually removing the underlying text. In a recent high-profile criminal case, the attorneys for one side redacted certain text with black rectangles prior to public release. But while visually blacked out, the underlying text itself was still fully accessible, both to search engines and even to ordinary copy and paste.
On the other hand, there is one important example where the PDF viewer display can make available text that might be completely hidden in binary format. That case is where a PDF file consists purely of an image. If you’ve ever tried to copy and paste a selection of text in a PDF, only to find that there is no text there to copy or paste, chances are you’ve run across an image-only PDF.
There are two points to consider. First, it is possible for a PDF file that is not image-only to have a security setting disallowing the copying of text. For that reason, the copy and paste test does not definitively indicate an image-only PDF, although it is frequently an indicator of one. Second, an image-only PDF is not completely unsearchable if that text metadata attached to the PDF remains searchable. But the vast majority of the text will nonetheless remain out of searchable reach.
If you do have an image-only PDF, the way to make that full-text searchable is to run it through an OCR (optical character recognition) process, such as the one included with Adobe Acrobat. Following OCR, the resulting PDF display options truly highlight the innate benefits of the PDF file format. Specifically, the PDF can retain its full original image while adding the results of OCR “underneath” the image for use by document filters. That way, if there is a scribbled note or embedded graphic or something of that nature which the OCR process might not fully resolve, the PDF viewer can still display that additional information concurrently with document filters leveraging the rest of the text.
If you have a large collection of data, PDFs that require OCR can be hard to spot. A search engine can flag these so that you know if any PDFs require OCR prior to full-text search. Finally, any time you OCR a file no matter how good the OCR process, it is always possible to introduce OCR errors, particularly if the text in the original image is less than perfectly clear. In such cases, fuzzy searching helps in sifting through minor typographical errors.
Contributed to pdfa.org by dtSearch