ISO 32004: an overview
Matthias Valvekens, the ISO project leader who led development of ISO 32004, has written a pair of blog posts to explain what ISO 32004 does and does not achieve in leveraging MAC technology for securing encrypted PDF files against tampering.Introduction
For a few years now, the PDF industry has been working on a new mechanism to improve integrity protection in encrypted documents. ISO/TC 171/SC 2, the committee that manages the PDF standard, took ownership of that effort in the form of the ISO 32004 project.
A few months ago, that project moved into the DTS stage — the final step prior to publication! Now's a great time to go over which problems ISO/TS 32004 sets out to solve, and perhaps more importantly, which problems it doesn't solve. Since I've been leading the project for the past year, I figured that I was in a good position to provide that write-up.
The goal of this post is not to explain what ISO/TS 32004 says exactly, nor to tell you how you should implement it at the technical level. Rather, I want to help other PDF developers understand the context of the document so they can make better use of it.
Motivation
Life in PDF-land without integrity protection
A few years ago, a team of security researchers at the Ruhr University Bochum (RUB) described a number of attacks against PDF's encryption feature set.
Broadly, the attacks announced fall into two categories.
- data exfiltration tricks allowing an attacker to smuggle data out of an authorised user's working environment;
- content manipulation exploits that allow encrypted PDF files to be changed without the knowledge of the encryption key, and without downstream users noticing.
The former category is more properly classified as exploiting the behaviour of specific viewers, and that's not what this post is about. However, the latter set of attacks really point to issues with the specification itself. Here are some of the highlights.
First up, we have cryptographic malleability issues.
- As of ISO 32000-2:2020, PDF supports RC4 (legacy) and AES-CBC for content encryption out of the box. Both of these schemes are unauthenticated, i.e. don't allow the decrypter to determine cryptographically whether a ciphertext has been tampered with. This is already problematic, but in the case of PDF, relatively difficult to exploit directly; it requires a little extra help.
- Unfortunately, that extra help is provided by a little piece of ciphertext with a plaintext required by ISO 32000-2:2020's most recent security handler. That actually allows for the systematic injection of attacker-controlled content. If you're interested, give the RUB team's explanation of CBC malleability a read.
Moreover, there are also fundamental ways in which the design of the PDF standard's encryption features leaves documents wide open to manipulation, irrespective of any cryptographic issues.
- In PDF, only streams and strings are encrypted. All other objects can be manipulated freely.
- A PDF file's cross-reference data, which acts as a "roadmap" of sorts for the document, is never encrypted and can be manipulated to do all sorts of crazy things.
- Did I mention that you can also straight up delete encrypted objects with impunity?
- Even better: ISO 32000-2:2020 allows "local" encryption overrides through the Identity crypt filter. This feature was intended to allow things like file metadata to remain unencrypted if desired, but actually allows entire streams, pages, images, etc. to be replaced with attacker-controlled ones as long as they declare themselves as unencrypted. Oops.
So, the upshot is this: given a legitimate encrypted PDF document, a sufficiently clever attacker can make the document say whatever they want without knowing the key. If an unsuspecting user then opens the document in their favourite viewer, they'll get a password prompt (as they probably expected), When the correct password is entered, the viewer will dutifully display the attacker-controlled content, and our poor user is none the wiser.
This is a problem, because in the mind of a business user — and the same probably goes for most tech people! — something being "password-protected" is pretty much synonymous with "kept under lock and key". People expect password-protected data to be totally inaccessible and untouchable without knowledge of the password. This expectation is being subverted by PDF's approach to encrypting documents.
Clearly, something had to be done...
Authenticated encryption
The mismatch between users' (by no means unreasonable) expectations and reality is due to the conflation of the following two security properties:
- Data confidentiality means that the data can't be read without the key.
- Data authentication is necessary to ensure that data can't be altered without knowing the key.
Historically, the authentication aspect was not a concern in the design of PDF's encryption features. That point of view is now very dated 1, but the benefit of hindsight doesn't really help us solve the issue at hand. So, what can we do?
In modern cryptographic practice, the authentication problem is typically addressed by augmenting the ciphertext with a Message Authentication Code (MAC). A MAC is a kind of keyed 2 digest, computed over the ciphertext. Tampering with the ciphertext invalidates the MAC, and to recompute the MAC, you need access to the key. The idea is that the receiving party independently computes the MAC over the given ciphertext prior to decryption, and rejects the message if the result doesn't match the value supplied by the sender.
Warning
It's crucial here that validating a MAC requires being able to recreate it. Both operations require access to the same key. There's no public/private key distinction. All participants in the process have exactly the same capabilities when it comes to producing and verifying MACs.
There are many different ways to construct MACs. Here are some examples:
- HMAC is a family of MACs constructed from a hash algorithm (e.g. HMAC-SHA256). The cryptographic strength of the resulting MAC depends on that of the underlying hash function. HMAC is very widely used and has also been adopted by ISO/TS 32004.
- Poly1305 is a MAC scheme that essentially works by considering the message as a polynomial over a finite field (the field of order 2130 - 5, incidentally), and evaluating that polynomial at the key. It's commonly used in conjunction with ChaCha, including in TLS.
- CBC-MAC is a (somewhat dated) MAC scheme that works by keeping the last block of a CBC encryption operation as a MAC. It used to be popular before HMAC gained widespread adoption, but is less common now.
- GMAC is the MAC scheme that lies at the basis of the AES-GCM authenticated encryption algorithm.
- KMAC is a more recent MAC scheme that one gets more or less "for free" from the fact that the design of SHA-3 (Keccak) is resistant against length extension attacks.
Most authenticated encryption schemes in common use — including AES-GCM, AES-CCM, ChaChaPoly1305 and many others — are constructed by combining an encryption primitive with a MAC function 3.
So, having digested all that information, it seems that all we have to do is to apply a MAC to our encrypted data. Now, let's figure out how that's supposed to work in a PDF document.
This would be a good time to go read MACs vs. signatures in PDF, the companion post to this piece that explains why the distinction between MACs and digital signatures is relevant. It's not a strict prerequisite for understanding the rest of this article, but it's useful background info.
Design philosophy
General design
The initial requirements of the PDF MAC project were more or less the following.
- Find a way to protect entire (encrypted) PDF files using a MAC.
- Leverage the shared secret behind the encryption (usually a password) to provide keying material 4 for the MAC process.
- Make sure the system is backwards compatible: files with MACs should remain readable to non-MAC-aware processors.
- At the same time, there needs to be a degree of protection against attackers removing MACs.
The MAC scheme standardised in ISO/TS 32004 integrates into PDF in much the same way as digital signatures.
- There's a ByteRange to indicate the covered portion of the document.
- The actual MAC token is stored in the "hole" left by the declared ByteRange.
- Like signatures, MACs are embedded using CMS — although we use `AuthenticatedData` instead of `SignedData` for obvious reasons (see RFC 5652).
The following figure illustrates what an ISO/TS 32004 MAC looks like in context. The colour coding indicates the parts of the covered byte range:
Compatibility with digital signatures
This ByteRange-based approach works well enough, but there's a snag: any given revision of a PDF file can only have a single "complete" ByteRange! Early drafts of ISO/TS 32004 solved this problem by decreeing that MACs could only be used in unsigned documents.
Since MACs and digital signatures serve very different purposes, that incompatibility didn't sit well with me. Especially since there's an easy, backwards-compatible solution! PDF signatures use CMS `SignedData`, which supports attributes, so we could simply let the MAC token "hitch a ride" on the signature. That way we only need a single ByteRange to make both the signature and the MAC work 5. The structure of the MAC token is otherwise pretty much the same as in the unsigned case.
While signatures in encrypted documents aren't a very common sight, we occasionally come across signed encrypted documents. By allowing MACs in any encrypted document, we can achieve (more or less) the same integrity guarantees for all such documents. This uniformity also benefits validation: a MAC checker with zero signature validation capabilities shouldn't have to make judgment calls about whether documents with a signature (and no MAC) are adequately protected. In addition, the fix was simple enough that sacrificing compatibility wasn't worth it. After some discussion, we decided to put it in the spec.
For an example where the separation of concerns between MACs and signatures is even more clear: a document timestamp signature ordinarily has no authenticating value. Timestamping servers don't care about what they sign, and it's possible to add a timestamp to an encrypted document without knowing the key. In other words, there's no accountability at all. Adding an ISO/TS 32004 MAC token as an (unsigned) attribute on the signature is a way to solve that problem.
Compatibility/security trade-off
In the PDF world, backwards compatibility is a big deal. When new functionality is considered for standardisation, one of the most important criteria involves evaluating how existing software would cope with the change.
This was no different for ISO/TS 32004: a document with a MAC still needs to be understood by software that doesn't know how MACs work. That, in itself, is a good thing.
The converse problem is more tricky, though: how can a MAC-aware processor tell the difference between a "legacy" document without a MAC, and a document from which the MAC has been (maliciously) stripped? 6 Paranoid implementations could perhaps enforce MACs rigorously, but that might not be feasible for everyone.
To address this concern, ISO/TS 32004 defines an extra permission bit to indicate whether a MAC is expected to be present. Since the permission bits are already protected by a "pseudo-MAC" of sorts 7 in PDF 2.0, there's a degree of tamper-resistance built in.
Pitfalls and judgment calls
User passwords
PDF's standard security handler distinguishes between "user passwords" and "owner passwords". It's somewhat common for people to apply encryption to a PDF document, but leave the user password empty, while still setting the owner password to something else. This is the digital equivalent of a "No Trespassing" sign on an unguarded fence. Sure, bona fide viewers will enforce permission bits if the owner password is not supplied, but nonetheless, anyone can compute the file encryption key if the user password is left empty.
In other words, from a purely technical perspective, anyone can decrypt and modify the document content. This is no different in a scenario where MACs are used: if the user password is empty, anyone can validate, but also regenerate the MAC. In other words, a MAC offers precisely zero protection in this situation.
Warning
Remember: a MAC that anyone can verify is a MAC that anyone can forge. MACs are based on shared secrets. They're most useful if the relationship between the parties involved in a workflow is symmetric (i.e. everyone has the same access level).
In fact, if I'd write some piece of software that required MACs for everything, I'd actively reject PDF files with empty user passwords.
Coverage checks
A MAC should cover the entire document to which it is applied (other than the MAC container itself). As with digital signatures, coverage is indicated by the associated ByteRange. When receiving a document with a MAC, it's very important to check the ByteRange: if the covered region is too small, unauthorised changes could still lurk in the "unprotected" regions. Processors incrementally updating a document are also expected to update the MAC (including the coverage range).
This presents a problem: what to do when one receives a document with a "stale" MAC (i.e. a MAC that doesn't cover the full document anymore). That could be the result of a malicious edit, but also due to an authorised change by a tool that doesn't implement ISO/TS 32004.
Again, this one boils down to choosing compatibility vs. security, and there's no easy answer. Personally, I would write my implementation to reject such documents by default.
On by default: yes or no?
Cryptographically, speaking, adding MACs to all 8 encrypted PDF documents seems like the obvious thing to do. Ultimately, the security of the MAC process is protected by the same shared secret as the document encryption, so there's no need to involve external keying material (as would've been the case with signatures). In particular, the user doesn't necessarily need to do anything special to benefit from MACs.
That said, MACs do come with an I/O performance cost. As with signatures, the PDF data needs to be serialised to a place where the MAC can be inserted in place later.
Depending on the application architecture, that cost might be completely negligible, or prohibitive 9. I'd still recommend turning it on by default if you can — especially when updating files that already have MACs!
Conclusion (tl;dr)
ISO/TS 32004 provides you with a MAC-based tool to protect your encrypted PDF files from malicious tampering. The MAC is bootstrapped from the same shared secret as the encryption uses.
Additionally, ISO/TS 32004 is fully backwards compatible and can be used in conjunction with all PDF 2.0 features, including digital signatures.
Things to keep in mind:
- Think carefully about whether using MACs and/or digital signatures makes sense for your workflow. They're very different things.
- Backwards compatibility comes with some security trade-offs. Keep those trade-offs in mind.
- Validation for MACs is a lot more straightforward than for signatures.
Oh, and if you're keen to implement ISO/TS 32004 yourself once it's out, please give Annex B a proper read for some extra info on what to look out for when validating MACs in PDF documents.
Footnotes
1: The technology and standards to include this kind of integrity protection have also been around for a long time. For example, the research around HMAC dates from the mid-90s, and the IETF first standardised it in '97. PDF 2.0, which included a major restructuring of the file encryption functionality in PDF, first saw the light of day two decades later. ⇐
2: The MAC key and the encryption key are usually derived from a common piece of secret data that is shared between the communicating parties. ⇐
3: AES-OCB is a notable exception; here, the computation of the authentication tag is tightly integrated with the encryption process. ⇐
4: To be pedantic: the MAC key in itself actually isn't derived from the password or the file encryption key in ISO/TS 32004. Rather, it's encrypted using a key derived from the file encryption key, following a common pattern used with CMS `AuthenticatedData` and `EncryptedData`. ⇐
5: In a signed revision of a PDF document, the MAC token is actually computed over the digest of the byte range together with a digest of the signature. As such, it protects both the signature and the document content. ⇐
6: The analogous problem for incremental updates by non-MAC-aware processors is left as an exercise to the reader. Alternatively, if you're not in the mood, feel free to peek ahead. ⇐
7: The "pseudo-MAC" being a separate entry computed by encrypting the permission bits using AES-ECB (!). Perhaps ironically, this piece of known AES plaintext was a large part of why we needed ISO/TS 32004 in the first place... ⇐
8: Strictly speaking, all encrypted PDF 2.0 documents, since ISO/TS 32004 is an extension of ISO 32000-2:2020. ⇐
9: There are ways around that, though. In unsigned revisions, ISO/TS 32004 forces the MAC container to reside in (a subdictionary of) the document trailer, which means that it's somewhere near the end of the file. That can be leveraged to generate MACs very efficiently, even on large documents that don't fit into memory. Retrofitting that onto a legacy codebase is of course easier said than done. ⇐
A version of this article was originally posted at https://mvalvekens.be/blog/2022/about-iso32004.html