Part 1: Defending Your PDFs: Blocking AI Models from Scraping Your Data

Patrick Gallot, has been working with software developers and PDFs since 2000. At Datalogics, he is the lead technical support Engineer for the Adobe PDF Library and Datalogics PDF Java Toolkit, helping our customers resolve their complex questions and challenges. He is also active in the PDF community, presents at … Read more


Understanding how to use TDM Reservation Protocol (TDMRep) for Ethical Text and Data Mining of PDFs
AI can search and use content from PDFs for the purpose of training generative AI. This includes Text and Data Mining (TDM) as a whole. This article shows your options for allowing, limiting, and/or preventing data scraping from occurring.
Specifically, we’ll explain how to add machine-readable language to a PDF that serves one or more of the following use cases:
- Clarify to TDM users that rights are reserved for PDF content.
- Inform TDM users on the legal process for accessing and using the content of a PDF, including rights clearance and compensation where required.
- Opt-out of searching PDF content by machines entirely.
Working with the TDM Reservation Protocol
The TDM Reservation Protocol (TDMRep) defines an “opt-out” mechanism for text and data mining (TDM). It addresses two key points:
- Whether specific content is permitted to be searched/mined for TDM purposes, and
- How rights holders can be contacted to obtain a license for such use, if required.
The most recent version of the specification is available here.
TDMRep’s rights reservation model has two properties:
- tdm-reservation (boolean) indicates if mining rights are reserved or not.
- tdm-policy (URL) gives access to publishers’ contact information and conditions for obtaining authorization to mine content.
Technically the latter is optional, however leaving it out may render the effort otherwise ineffective.
These properties can be added to a PDF’s XMP metadata—either proactively during creation or retroactively after publication—by an author or publisher to indicate a reservation of rights and provide a pathway for obtaining a license for text and data mining (TDM) access. See the example from W3C below:
Source: https://www.w3.org/community/reports/tdmrep/CG-FINAL-tdmrep-20240510/#example-8
What does a TDM Policy look like?
A TDM policy is in JSON-LD (JSON for Linked Data) format and uses standard terminology from ODRL (Open Digital Rights Language), a proposed language for the standardization of digital rights management.
Here’s a list of required properties for the JSON structure, followed by example policies, courtesy of W3C:
- @context: an array of two values, specifically “http://www.w3.org/ns/odrl.jsonld” and “http://www.w3.org/ns/tdmrep.jsonld”
- uid: an identifier for the policy, expressed as a URI. It is not required that this link references a publicly accessible resource, but that is encouraged. W3C makes a simple recommendation to link to the website’s human-readable Terms of Use.
- @type: The value of this property must be “Offer”. Per the ODRL model, Offers are “proposals from Rights Holders for specific Rights over their Assets.”
- profile: The value of this property must be “http://www.w3.org/ns/tdmrep”.
- assigner: This property contains a set of child vCard properties for contacting the rights holder. By our best interpretation of the spec, it is not required that all of these vCard properties are used, but if any are used, they must be within the list provided by section 7.1.4 of the TDMRep spec.
- permission: an array of permissions. By our best interpretation of the spec, this array only has one required element—“action”: “tdm:mine”—which expresses that the permission to be defined in this policy is to “analyse, via automated analytical technique, text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations”. In the interest of best serving the purpose of this spec, special mention is to be made here that a “duty” element is also listed, which expresses the TDM Actor’s duty to “obtain verifiable consent”. The spec doesn’t explicitly say that this element is required, but it seems important enough to warrant a strong recommendation at minimum.
Note that there are other JSON properties listed in the spec that fall under “recommended” and “optional”. The above list is for strictly required properties only.
Here is a simple example created by Datalogics:
An example by W3C can be found here: https://www.w3.org/community/reports/tdmrep/CG-FINAL-tdmrep-20240510/#example-19; this example contains a structure for defining access as it pertains to scientific research, which can vary greatly by locale.
To learn more about how to implement a TDMRep policy in your PDFs, please read part two.