What can we learn about the Mueller Report from the PDF file released by the Department of Justice (DoJ) on April 18, 2019?
This article offers two things:
A Technical and Cultural Assessment of the Mueller Report PDF (this article)
A Technical and Cultural Assessment of the Mueller Report PDF (this article)
Redaction is the process of removing content from a document. There are various ways to achieve redaction in electronic documents, ranging from removal of content from an original source document to printing and re-scanning after redaction. Unfortunately, DoJ chose the latter approach, resulting in a pixelated, low-quality document that will make a poor showing in all subsequent uses.
We downloaded "report.pdf" (PDF, 139 MB) from the Special Counsel's page on the US Department of Justice website.
From a PDF technology perspective the file uses PDF 1.6 technology and is of acceptable quality. It does not conform to ISO 19005 (PDF/A), the archival standard for PDF files. It is not digitally signed or encrypted for security.
Based on its metadata, the PDF released by the Department of Justice was produced using Ricoh MP 6C502 software, probably a typical office network copier / printer. The file was produced on April 17 after 6:23 pm.
The document consists of 448 200 dpi RGB (color) images all 2200 x 1700 pixels in size. The images were compressed with lossy compression more appropriate to photographs than to text. This is the cause of the "noise" associated with the text.
Analysis: The fact that DoJ chose to deliver an "images only" PDF forces a much larger file-size and loss of searchable text. Effectively, this process "dumbed down" the PDF to a set of images - the same type of content that comes out of a scanner. Admittedly, it is also a crude but effective means of ensuring (beyond redaction) that nothing is released besides images of pages... but the redaction software available to DoJ (see below) is fully effective at redacting born-digital PDF files, so image conversion was unnecessary.
From the scanner artifacts left on the images (e.g. the horizontal yellow streak and the gray vertical streak on the right edge) and the voluminous compression artifacts, we assess that the document has certainly been scanned and compressed at least once and more probably twice.
Although DoJ did not OCR the report prior to its release, those downloading the file are free to use their own OCR. Results will not be ideal or identical since the source images are of relatively low quality. In particular, OCR errors will be more common adjacent to underlines and redactions.
Analysis: We assess that the document was most likely scanned twice, with redactions being added to the first scanned document using software. This implies that the document may have been provided to DoJ on paper rather than as an electronic document. If it was provided by Mueller to DoJ electronically, then printing it just to scan it back into another, far larger and less capable PDF is difficult to understand.
In addition to not being searchable, the file contains no text, is not tagged, and is therefore not accessible to disabled users.
The US Department of Justice has a clear policy of ensuring that public documents comply with Section 508 regulations, and are therefore accessible to users with disabilities. The Mueller Report PDF does not conform with these regulations.
If the Mueller report was delivered to DoJ as a high-quality born-digital PDF, it would have been tagged from the outset. DoJ could have easily redacted it without resorting to printing the result and and re-scanning the printed paper.
Analysis: If Mueller had delivered a paper document instead of a PDF, then DoJ's process, while not best practice or even within the regulations, is more understandable due to time pressures. If Mueller had delivered a high-quality PDF, however, then it's exceptionally unfortunate that DoJ chose to "dumb it down" when processing and releasing it.
The PDF includes bookmarks for 15 subsections of the document that are not reflected in the document's logical structure; these subsections don't match the document's table of contents.
Analysis: This fact implies that DoJ broke up the Mueller Report into subsections matching the bookmarks to allow a team to collaborate over adding redaction annotations to these subsections, then reassembling the document and outputting a finished, redacted PDF file.
Due to their consistency and regularity of form and application, it's clear that the redactions were performed by software rather than manual methods (i.e., to a printed document). The redaction implementation (style, spacing, label) is completely consistent throughout the document, indicating expert use of professional-class redaction software.
Using high-quality redaction software allows organizations to collaborate effectively on such projects, ensuring that the type of redaction used, as well as the color-codes and other features, are consistent for all collaborators. It is to be expected that DoJ possesses and is expert in the use of such software.
Instead of delivering "native" redactions, however, it's obvious that DoJ printed and then scanned the document after it was redacted. We know this because on many pages a scanner artifact (the faint yellow line) crosses a redacted area. This deliberate and unnecessary act made the document substantially harder for anyone and everyone to use, forever.
Analysis: I asked Mark Gavin, CTO of Appligent Document Solutions, and the developer of the first PDF redaction tool, for his comments on the redaction method used in the Mueller Report. Mark said:
"Native PDF redaction has been available now for more than 20 years, yet this document is just images of redacted pages. As such, there is no searchable text, the document will not reflow on different devices and most importantly this document is not Section 508 compliant. The document cannot be read by a screen reader for people with visual disabilities and it cannot be analyzed using any text analysis tools. The Mueller Report as a redacted PDF document is really kind of sad."
It's interesting - and deeply unfortunate - that DoJ clearly used advanced redaction software but nonetheless chose to deliver a paper-age "images only" PDF. In so doing they:
Everyone knew that the US Department of Justice and Attorney General Barr would release the Mueller Report as a PDF file.
In fact, it's safe to say that AG Barr never considered delivering anything else. No one would have even suggested a Word file, or a set of TIFF images, or a website, or an XPS file, or EPUB, or plain text. It's 2019, but it seems safe to say that they simply assumed they'd use PDF.
If you are like most people, you simply assumed it would be a PDF as well.
Once he was done writing and editing, Mueller needed to unambiguously "freeze" or fix his document for the purposes of submitting a report. PDF is the only mainstream document format offering this capability.
Why is the fixed nature ("rendering") so important? It contains the clues humans use to judge authenticity, such as layout, formatting, dates, logos and signatures, and in many other, more subtle ways.
Everyone knows this, which is why people exchange contracts rather than simply share access to a wiki page. The need for a rendering made it easy to predict ahead of time that Barr would release the Mueller report as a PDF, and would never have considered converting its text to DOCX, or posting the text as HTML on a website.
In releasing the redacted PDF of the report to the public, Barr avoids suspicion that the document had been edited (changed) in addition to straightforward redactions. PDF serves the need to unambiguously assure the press and the public that they are seeing Mueller's actual report.
PDF is the only electronic document format that fully supports redaction. The alternative is a printer + a pair of scissors, grease pen or opaque tape. There's really no model for redaction of HTML-based web content. Redaction of DOCX files, while possible, doesn't provide any assurance about the content prior to its redaction.
A PDF file, whether "born digital" or scanned from paper, can be made text-searchable and accessible. Barr chose to release scanned pages instead of searchable, accessible pages, but this may have been due to time pressures.
These days, PDF is supported - at least to some degree - by most browsers and on most platforms, in addition to the thousands of PDF technology-specific applications on the market. Using PDF ensures that all of those diverse users will get a consistent, reliable PDF rendering (although, in the case of this report, sadly, no tags for accessibility, as required by law).
Unfortunately, the image-based PDF the Department of Justice delivered is the least easy-to-use of any option they could have chosen.
PDF is the only document format capable of carrying the cultural and technical requirements for important communications in the modern age.