XMP – Master of All Data

About Carsten Luedtge,

In document processing, nothing works without meaningful metadata at least not if you want to automate it. The XMP format is key.

Member NewsSeptember 10, 2014

XMP – Master of All Data

In document processing, nothing works without meaningful metadata at least not if you want to automate it. The XMP format is key.

Member NewsSeptember 10, 2014

About Carsten Luedtge,

DISCLAIMER
The views expressed in this article are those of the author(s) and do not reflect the policies or positions of the PDF Association.

The digitization of document processing is proceeding apace. Companies are sending, receiving and processing more and more invoices, damage reports, contracts, customer confirmations and other correspondence electronically. As a result, input and output management are beginning to converge. Automating as many processes as possible is especially important in this endeavor. Metadata, or the information about the document (type, creation date, sender, reference to other processes), drives those processes.

Metadata is hardly new, having become a fairly familiar element in the workaday world. In a typical input management scenario, documents are scanned, converted to text via optical character recognition (OCR), and then usually stored in an archive or data management system (DMS). At the same time the metadata must be identified via barcodes, rules, and heuristics and then stored to correctly assign and categorize the documents and ensure their retrievability at any time. A lot has to be checked manually afterwards, which is naturally quite costly.

Paper versus XML

Today there are two basic variants of document exchange: traditional paper or strictly electronic in an XML or EDIFACT format, which is not printable per se. Current developments such as ZUGFeRD combine the two extremes in a single format, which ultimately is best. But if individual PDF files are generated as virtual by-products from the traditional print data stream and sent to the recipient, a gap emerges. In spite of electronic transfer, there is usually too little metadata to automatically process the document. So how do we further automate non-standardized document exchange?

XMP: Bridge between physical and electronic exchange

This is where XMP enters the picture (see dossier). An XMP packet is an XML file that defines the guidelines for embedding the metadata not only in PDF documents -- surely the most frequent use case -- but also in PostScript, JPG, PNG, TIFF, HTML and AFP files. XMP packets have a major advantage. They contain a unique marker and are, where possible, always saved in plain text so that even an application that fails to understand the specific data format can still extract the XMP metadata. As for the use of plain text, caution should prevail where confidential information is concerned.

Example of an XMP packet:

1.4

XMP defines a set of core properties that can be used universally (title, creator, topic, date, unique identifier, language, description). XMP falls back on existing methods and standards to describe metadata (ontologies) such as Dublin Core, IPTC, and Exif (see dossier). XMP also allows the definition of individual attributes such as customer and policy number, document validity, invoice due date, and name/version of a document template.

XMP awareness is still lacking

Currently XMP is used primarily in PDF/A. The ISO standard itself makes recommendations for defining and saving metadata. It mandates the XMP packet for PDF/A documents, for example, and recommends using a unique identifier. It also recommends carrying document origin information through the entire process, especially when conversions are done. In PDF/A files, all the individual properties must be described via an embedded schema. That could be a reason why XMP and the extensive use of metadata are still being neglected. Yet this step is far less complicated than it appears.

The topic is hardly new. After all, it's not as if we only just started thinking about how to automate the steps of document processing. What is new is the buzz being generated by the progressive conflation of input and output processes due to electronic exchange. Output management used to focus on producing and sending one's own documents efficiently and reliably. Now we're forced to consider the output side and what data will make processing the document easier for the recipient. One important thing to remember is that every instance of media discontinuity results in a loss of data that has to be tediously restored downstream.

Data quality is critical

Using the right data saves time and cost -- if not primarily for output management but certainly for archiving and, of course, the recipient. Taking the long view, accurate metadata does raise general awareness, ultimately benefitting one's own input management. The fact is that a minimum set of meaningful information can considerably simplify electronic processing. Against this backdrop, XMP is certainly an important step on the path to full automation on both the input and output sides. Of course, metadata can still be stored in documents without XMP, but its scope and processing quality are limited. In any case, there is currently no good alternative for PDF documents.

Dossier

XMP (Extensible Metadata Platform)

Standard for embedding metadata in digital files
Published by Adobe in 2001 and first integrated in Acrobat Reader 5.
February 2012: Publication of core XMP specification as ISO standard 16684-1

XMP is based on open standards and embeds the formal RDF (Resource Description Framework) published by the World Wide Web Consortium in binary data. Metadata is integrated in different applications according to a uniform schema, thus allowing other programs to read the files. The format is supported by all Adobe products, software from other manufacturers, and suppliers of editing systems.

Among other things, XMP defines:

the language of the document (one of the most important properties; especially important for the sight-impaired/reading aloud of the document via a screen reader in the correct language)
the creation date
author/company name (origin of the document)
Keywords

RDF (Resource Description Framework)

RDF is a technical approach used to describe Web resources (object, position, person) and their relationships to one another. RDF was originally conceived by the World Wide Web Consortium (W3C) as the standard for defining metadata In the meantime, RDF has become the fundamental component of the Semantic Web. RDF resembles classic conceptual modeling approaches such as UML class diagrams and the entity-relationship model.

Standardization was aimed at summarizing frequently used statements, via an object, into so-called ontologies that are identified by a namespace URI (Universal Resource Identifier). This allows programs to display data logically to viewers.

Ontology

is a group of concepts used to define metadata such as title, author, topic, description, date, identifier, language, camera type (for photos/picture) and where taken
conventional ontologies are Dublin Core, IPTC, Exif

ZUGFeRD

Uniform format for electronic invoices developed by the German Forum for Electronic Invoicing (FeRD)
Combination of the visual representation of a document and its raw data in a single PDF/A-3 file to avoid manual interventions in the automatic processing chain

Compart and XMP

The Compart Group is a highly experienced international provider of XMP solutions. This specialist in the optimization of data streams incorporates in its DocBridge product line the XMP model for PDF, PDF/A, PostScript, TIFF, JPG, PNG, XFF and XIF data formats. Compart solutions therefore ensure that XMP files accompany PDF-to-TIFF conversions and are available for downstream processing. An XMP-enriched PDF document can also be converted to AFP without degrading its performance as the standard format used in high-speed printing.

For every area in document processing where XMP plays a role, Compart offers the appropriate support, including generating valid PDF/A files for long-term read-only archiving and automatic document comparison. The DocBridge line of solutions benefits from modular construction for seamless integration into existing document and output management structures. Companies that need powerful XMP support without replacing their entire document processing system profit from this modularity. You benefit even more from Compart's more than twenty years of solid expertise across the broad range of data streams.

Compart is an internationally active manufacturer of software for customer communication management. The company, with headquarters in Böblingen, has been present in the market for more than 30 years and has branches in Europe and North America. The scalable, platform-independent and easy-to-integrate solutions cover the entire cycle of document and…

Featured articles

Discover pdfa.org

Key resources

Get involved

XMP – Master of All Data

Paper versus XML

XMP: Bridge between physical and electronic exchange

XMP awareness is still lacking

Data quality is critical

Compart and XMP