Is the Information You Just Redacted Really Gone?
Redaction is the act of removing content directly from the content stream of the page, traditionally done by placing a black bar over the text. But, is the sensitive information you’re removing actually gone?So, your organization is redacting sensitive information, like social security numbers, out of documents prior to making them available to the public. One of the common practices used to be adding a black box over the targeted content using Word. Then, the document was converted to a PDF file using Word’s built-in converter. The resulting PDF looks perfectly redacted - after all, the content is blacked out. Everything is going great until a couple of years down the line, someone in your organization realizes that anyone can just move the black box “redaction” to uncover the social security number underneath.
This happens more often than you would think. I was on a call with a company that had done just that. They had thousands of incorrectly redacted documents and were looking for an automated solution to perform real redactions on those documents.
I came to realize that a lot of the problems around bad redactions could possibly stem from the fact that it’s not clear what real redactions really are. So, what is redaction? In PDF, redaction is the act of removing content directly from the content stream of the page. An optional piece of content is usually added in place of the removed content to indicate something has changed. This is traditionally a black box, however, it does not have to be a box, and the color does not have to be black. The important and mandatory part of redaction is that the content is permanently removed from the document.
Redaction is typically a 2 step process, with an optional 3rd step
- The content to be redacted is identified and redaction annotations are placed over it
- The redaction annotations are reviewed and applied, permanently removing the content
- An additional step is to sanitize the document, cleaning up sneaky data like metadata, bookmarks, links, and anything that could have content in it that you do not want available
As you can see, if those steps are not followed properly, many things can go wrong, and you might end up distributing documents that still contain sensitive information. The most common example of incorrectly redacted documents is the one that I started the article with. We focus on the optional black box that goes over the content, and don’t realize the content is still readily available in the document. What if we decide to manually select and delete the content, and then manually add a black box over it? Aside from this being a laborious process, there are some major downsides to it. A lot of tools keep versions of a document without us ever realizing that. Those versions will contain previously deleted content. Metadata can also contain previously deleted content, or references to it. Properly redacting a document will take care of all of those issues.
Another common mistake while attempting to redact a document is to change the font color of sensitive information to simply match the background. The idea is that if you can’t see the text, it’s not there. This is perhaps the least secure of all the incorrect redaction methods available. Simply selecting all the text on a page will reveal all the “hidden” text. Furthermore, this text can be searched for, and changing the font color back to a visible one is easy.
You want to use a tool that is designed for proper redaction. But, what if we use a tool that claims to redact a document, but does a poor job? How do you know? Such tools are more common than you would think, so let me give you some pointers on redaction:
- Make sure the document is sanitized after the redaction. As I mentioned before, the document’s metadata can contain sensitive information. There can be bookmarks and links. Previous versions of documents can contain information we thought is redacted, making it readily available. Search indexes and review comments are also good hiding spots for sensitive data. A good PDF redaction tool will clean up all those, and more, during sanitization.
- Use a tool that has a clean redaction. Some tools can redact just fine, but they are what I call too 'loose'. Instead of just redacting the social security number for example, you also lose nearby content that could be above, below, left, and right of what you were targeting. See image below for an example of what I mean.
- Redaction is commonly used with text. However, redaction can apply to different types of content – diagrams for example. While you are trying to partially redact sensitive information out of a diagram, the tool you are using might not be able to do that and redact the whole image instead. Make sure the tool you are using handles image redaction properly.
The screenshot above is from a document redacted with a popular PDF tool. The tool not only redacted the desired information, but also text on one line above and below each redaction.
This is what this document should look like when it’s properly redacted. See image below.
What tools and methods can we use to redact PDF documents properly? Adobe Acrobat is the industry leader when it comes to end user PDF tools. It has a very comprehensive set of redaction tools. It’s the preferred and recommended tool of many US government institutions. Plus, there are great resources available that explain the redaction process in detail. You can see instructions on how to redact a PDF document using Acrobat DC here. If you are using Acrobat X, check out Rick’s Acrobat X Redaction Guide.
While Acrobat is great for redacting single documents, what happens when you want to redact batches of documents? That’s where Datalogics comes in. We offer the tool that drives Acrobat, and its redaction process – the Adobe PDF Library. We also offer another Adobe tool that can help you redact documents – Datalogics PDF Java Toolkit. Both can offer batch redaction functionality as well as an automated, but on the fly, document by document process.
Redaction is a very important tool in the document market. If performed incorrectly, sensitive information can leak to the public, potentially leading to lawsuits, scandals, etc. To avoid that, you need to make sure your documents are redacted correctly, you have the right processes in place, and are using the right tools.
For more information about redaction, contact us.
Datalogics, Inc. provides a complete SDK for PDF creation, manipulation and management for companies around the globe. Built on Adobe source code, our flagship product Adobe® PDF Library offers a choice of programming platforms and languages along with unsurpassed customer service, proven by our 94% customer retention rate. Datalogics offers…
Read more