Files inside PDF
Accurately understanding the kinds of data that can be in a PDF and how it is stored can be very important. This article aims to explain the different ways by which PDF stores “files”.How do “files” end up inside PDF files? There are many ways…
BUSINESS NOTE
Users will better understand, navigate, and leverage the many possibilities of PDF’s embedded files features if the terminology they encounter around “files” is clear and precise.
BUSINESS NOTE
Users will better understand, navigate, and leverage the many possibilities of PDF’s embedded files features if the terminology they encounter around “files” is clear and precise.
From dragging and dropping a photo, inserting a clipart image from a cloud library, a context menu to “convert file to PDF”, installing a fancy new font, or embedding a spreadsheet into a word processor file are just some of the ways we use “files” in a way that eventually places them inside a PDF file.
In many cases, these files do not end up as visible “file attachments” in your typical PDF viewer and this can cause confusion - where did my file go?
What is meant by a “file” with PDF depends on your viewpoint, and may include notions and expectations derived from how HTML and other formats may do things, some of which differ from the metaphors used by PDF software.
End users are sometimes perplexed as to why they cannot simply “download” the JPEG file of a photo in a PDF, or extract a font from PDF viewing software, yet they might be able to extract the embedded spreadsheet they’d attached. If your role relates to digital preservation or file forensics then accurately understanding the kinds of data that can be in a PDF and how it is stored can be very important.
This article aims to explain the different ways by which PDF stores “files”.
Metaphors
The PDF specification, ISO 32000, establishes processing requirements for page rendering, but does not constrain how PDF viewing software might present other features to users. Software is free to choose terminology and user interface metaphors most appropriate to its users.
Diverse interfaces foster various end-user understandings of what “files” exist inside any given PDF. This is especially true when we consider the many similar terms in use. Understandings differ or are ambiguous for end users depending on their background, and the terms themselves may have different technical meanings. Some of these terms used to describe “files” in the PDF context include:
- File
- Embedded file
- File attachment
- Image file
- Asset
- Collection
- Portfolio
- Package
- Related file
- Associated file
- Stream
- File specification
This variety of terms can be very confusing, and can lead to misunderstandings. Ideally, vendors should provide clear documentation on their use of terminology.
Files are streams, but streams are not always files
PDF is an object-based page description language. For this discussion, the two primary types of objects related to “files inside PDF” are stream objects and file specification dictionaries.
One of PDF’s nine basic object types, PDF stream objects (ISO 32000-2:2020, subclause 7.3.8) define a sequence of bytes stored in the PDF. Streams can contain any type of data, are often compressed, and can be encrypted. They are used for content streams (drawing commands), images, fonts, ICC color profiles, metadata, and more, but not all these formats are necessarily valid independent external files in addition to their role(s) in PDF.
A PDF File Specification dictionary (ISO 32000-2:2020, 7.11) is a PDF dictionary object that defines the details of what can be thought of as a conventional file, including a file name, and information about where the files’ data resides, either embedded inside the PDF file as a separate embedded file stream object, or externally on the local file system or via a URL.
Accordingly, all embedded files in a PDF are stored as streams, but not all streams are necessarily files.
My image is no longer a file?
A confusing situation for many end users is when an image file (e.g. a JPEG photo) is converted to a PDF, such as via a “Convert to PDF” command. With HTML, for example, image files are simply referenced via their URL, but in PDF images are commonly represented by Image XObjects which are special kinds of data streams containing the bitmap data; the image width, height, bit depth, color information, etc. - they are no longer “files” from a PDF viewpoint. To paint the image onto the PDF page, Image XObjects are referenced by a Do operator using the current transformation matrix, allowing Image XObjects to be positioned, reused, scaled, skewed, etc.
Users should not expect PDF conversion software to use the source image file “as is”. Although PDF natively supports JPEG images (via the DCTDecode filter), as well as JPEG 2000, JBIG2, and CCITT G4 images, PDF conversion software is free to change the image, scale the image’s pixels before storing them in the PDF, and/or recompress the image, possibly in a more lossy manner. Unlike HTML, where image files are simply directly referenced and used via their URL, this approach allows PDF creation software to optimize for file size and other needs.:
< img src="photo.jpeg" alt="My cat" width="200" height="185" />
PDF Image XObjects do not appear in the embedded files or file attachment list of most PDF applications, but they may still be extractable to a file via other means (e.g. a context menu in PDF software “Save to file”). Whether or not such extracted images are identical to the image data in the PDF file is again up to each implementation, but it should not be assumed that a JPEG photo converted to PDF and then exported again is faithfully preserved. The use of imprecise terminology such as “Image Files” in forensic software can also be very confusing!
File attachments
In a PDF context, “file attachments” refer to a PDF feature - “file attachment annotations” - that was introduced in PDF 1.3 back in 2000, as defined in ISO 32000-2:2020, 12.5.6.15. File attachment annotations are the long-standing PDF feature most commonly associated with “embedding files” in PDF, and are always associated with PDF pages. On a page, they are typically represented by a paperclip icon, however PDF also defines other icon options (Graph, PushPin, Tag). A single file attachment annotation can also be referenced from multiple pages.
Application software (e.g., email clients, office suite applications) commonly uses PDF file attachment annotations to represent email attachments or other forms of traditionally embedded files that get dragged and dropped into office suite applications (for example, adding a spreadsheet to a word processing document).
Embedded Files
The root of the PDF document object model is the Document Catalog dictionary. As shown in Figure 5 - Structure of a PDF document (from ISO 32000-2, reproduced below), the Document Catalog includes an Embedded File list under the Names dictionary:
The Names dictionary contains several PDF name trees - a complex set of PDF objects that together provide a type of index linking PDF string objects and other kinds of objects. In particular, the EmbeddedFiles name tree (introduced with PDF 1.4) maps filenames (strings) to PDF file specification dictionary objects for certain embedded files. This data structure does not support the concept of folders or subdirectories and files are not required to be uniquely named.
In a PDF context, the term “embedded files'' usually refers to those embedded file streams listed in the EmbeddedFiles name tree in the Document Catalog, however the specification does not require every embedded file stream to be listed in this index (e.g. File Attachment annotations, 3D and rich media assets), but does not prohibit it either. Accordingly, PDF creation software may differ in which embedded file streams they choose to add to the EmbeddedFiles name tree based on assumptions about PDF viewing applications or end-user interaction.
PDF software may also analyze a PDF to locate other embedded file streams not listed in the EmbeddedFiles name tree (such as file attachment annotations on pages), and/or to exclude embedded file streams that may have an alternate way of being presented (e.g. 3D or rich media assets), or are duplicated. PDF also supports groups of files, known as Related Files.
PDF Portable Collections
PDF Collections (subclause 12.3.5, ISO 32000-2:2020) were introduced in PDF 1.7 and define “how an interactive PDF processor’s user interface presents collections of file attachments, where the attachments are related in structure or content. Such a presentation is called a portable collection.”
Note that this excerpt from ISO 32000-2 confuses matters with the term “file attachment”. Collections are a completely independent feature from file attachment annotations, and both features may occur in a single PDF!
PDF portable collections (sometimes referred to as “PDF portfolios” or “PDF packages”) can be simplistically thought of as ZIP files with a user interface. Authors can opt to specify a specific navigation interface to present their collection of files, such as a detailed file list, a tile or filmstrip mode, a carousel mode, or to present the pages of the portable collection PDF itself. Collections also allow for organizing the files in the collection into hierarchical “folders” (in PDF, the folder hierarchy is defined using PDF data structures; there’s no internal “file system”).
PDF portable collections also allow the embedded files in the collection to have a custom sort order and custom data fields that the author can define. The example given in the PDF specification is for embedded email files where the schema has email-related fields such as “To”, “From”, “Subject”, etc. This additional customizable schema is unique to PDF collections.
PDF collections require that all files in the collection be explicitly listed in the EmbeddedNames name tree, and thus will appear in the respective panel of many PDF applications. This provides the necessary backward compatibility for applications that do not directly support PDF Collections.
Of course, a PDF Collection is not just a collection, but is also a PDF file in its own right. As such, it is not restricted and may include - in addition to and distinct from the collection - file attachment annotations, Associated Files, 3D or rich media assets, Image XObjects as other types of “embedded files”. The specification leaves PDF application developers free to decide how these different kinds of embedded files are displayed.
Multimedia and 3D assets
Movies, audio, 3D, and other forms of rich media assets can all occur within PDF documents. In PDF, these media are based on different types of annotation objects. Depending on the annotation’s Subtype entry, the annotation will use either streams or file specification dictionaries to store the media files. Many such assets would be valid file types if extracted as files, however PDF viewing software is free to decide if these assets should be presented using file-like metaphors or not.
Unlike the older movie and sound annotations originally defined in PDF 1.2, the newer 3D and RichMedia annotations require that their assets also be explicitly listed in the EmbeddedNames name tree, and thus will appear in the respective panel of many PDF applications.
Associated Files
PDF/A-3 (PDF 1.7) and later PDF 2.0 introduced the ability to relate embedded content with specific PDF objects in a new feature known as Associated Files (see clause 14.13 in ISO 32000-2):
“Associated files provide a means to associate content in other formats with objects of a PDF file and to identify the relationship between them. Such associated files are designated using file specification dictionaries [...], and AF keys are used in object dictionaries to connect the associated file’s specification dictionaries with those objects.”
Associated Files can be associated with any PDF object, along with an optional basic semantic relationship such as Source, Schema, FormData, or C2PA_Mainfest (for C2PA support). See the PDF Association’s “PDF 2.0 Application Note 002: Associated Files” for further recommendations. For backward compatibility reasons, Associated Files are always required to be included in the EmbeddedFiles name tree. They should thus appear in most PDF applications’ list of embedded files, even if the relationship is undefined or the associated PDF object cannot be indicated.
Metadata and other “file” streams
PDF files may contain many other kinds of “file-like” stream objects including multiple XMP metadata, ICC color profiles, fonts, CMaps, JavaScript, PostScript, etc. Although many of these formats can be extracted and stored as independent external files (with appropriate naming conventions), general-purpose PDF software may not expose such capabilities directly - such functionality is more likely found in specialized PDF applications for specific data types, or more generally, in PDF forensic analysis tools.
PDF also supports a feature called “incremental updates” in which document revisions (edits) are stored as deltas appended to the end of a file. In this way new objects may be added or existing objects may be marked as deleted while still remaining in the file. All PDF software must necessarily understand incremental updates and the view typically seen in PDF software is that of the final document - the original plus all revisions. Thus, for example, an incremental update may add a file to a PDF and a subsequent incremental update may then mark that same file as deleted. If incremental updates are used a standard PDF viewer might show that no file is present even though the object representing that file may still exist in the PDF! This knowledge is highly relevant to forensic and redaction workflows and is why tooling for those domains may show objects (including files) that other PDF software does not.
Disambiguating “files”
With so many uses of the term “files” in PDF and in application user interfaces, its usage can be confusing. Although there is some flexibility for PDF developers, there are some distinguishing factors:
- PDF file attachment annotations are always associated with one or more pages and are typically visually represented by paperclip icons on pages;
- PDF portable collections define a set of embedded files in the collection, optionally with an author-defined navigation experience, and optionally, with folder hierarchies (a feature that in PDF is unique to collections), sort order, and custom schema fields;
- An Associated File is an embedded file associated with a specific PDF object, which may optionally include a defined semantic relationship (AFRelationship key);
- Image XObjects represent images such as photos, but pixel data may be changed when initially converted to, or extracted from, PDF files;
- 3D and multimedia use page-based annotations, along with embedded files for the media assets;
- The EmbeddedFiles name tree in the document catalog must list all embedded files that are part of a PDF collection or are an Associated File, but may also contain other embedded files as decided by the author or authoring application. This list may contain many files and files of the same name;
- Other stream objects in a PDF, whether embedded file stream objects or not, may also be made available in specialized software to extract as a file or be referred to as a “file” even though the PDF specifications do not refer to it as a file or even use the PDF embedded file features.
PDF software has a lot of freedom in presenting and interacting with “files”, for example, to users: it may present only the files listed in the EmbeddedFiles name tree. Applications may solely rely on the information in the PDF file or they may use other methods to associate files with locally installed applications. Modern PDF software should also indicate Associated File relationships (if any), as well as any embedded file stream dictionary parameter values. For performance reasons with long PDF documents, file attachment annotations associated with pages may only be presented once those pages are viewed. These experiences rely on dictionary entries in the File Specification dictionary and embedded file stream, but may be supplemented by additional functionality in the PDF application (such as detecting the file type or associating with installed applications). Modern PDF applications usually include cyber-security measures regarding embedded files, such as hiding or prohibiting certain types of potentially malicious files (e.g. executable files).
PDF/A-2, PDF/A-3 and PDF/A-4 conformant software must also implement basic presentation of embedded files, even if file extraction is only a recommendation (quoting ISO 19005-3:2012, clause 6.8 :
A conforming interactive reader shall provide a mechanism to display the name strings from the value of the EmbeddedFiles key in the names dictionary of a conforming file. In addition, a conforming interactive reader may also choose to display information from the associated embedded file stream dictionaries or their Params dictionary.
Although embedded files that do not comply with any part of this [PDF/A-3] should not be rendered by a conforming reader, a conforming interactive reader should enable the extraction of any embedded file. The conforming interactive reader should also require an explicit user action to initiate the process.
How developers can help users to understand
With such a variety of “files” inside PDFs, software applications must assist users in understanding the context of each file. Avoiding duplicate files based on their PDF object number, understanding that multiple files in a PDF may have the same filename, taking care with terminology, and providing contextual information such as pages (indicative of annotations), associated objects (indicative of associated files), semantic relationships (AFRelationship key, as we have previously discussed), folders (a unique PDF portable collections feature), or indicating if a file is part of a portable collection can go a long way in helping users understand embedded “files” in PDFs.
Conclusion
The core ISO standard for PDF (ISO 32000-2:2020) is primarily a file format specification and does not mandate user interface terminology, user interface metaphors, how users interact with “files”, or even whether file extraction is a required feature. These are implementation decisions left to developers to best address the specific needs of their users and target industries.
Although PDF content can be created from “files”, this does not mean that a source file can always be recovered from a PDF even in the simplest of cases. Although the PDF file format supports many concepts that can be thought of as “files”, PDF’s flexibility can confuse users due to differing terminology, user interfaces, metaphors, and expectations derived from other formats.