PDF Research Portal

Cartoon PDF icon surrounded by lab equipment. The PDF Research Portal connects students, researchers, academics, and industry-based researchers who can provide research capabilities relevant to PDF. Topics span software engineering, computer science, color science, data science, image processing, cybersecurity, machine learning, education, accessibility, and business domains.

The list of topics below represents current topics of interest to the PDF Association and the broader electronic document community.

If you are interested in participating in any of these, either by carrying out a research project or providing guidance or resources, please contact the PDF Association at pdf-research-support@pdfa.org.

These topics are not in any specific order, and many are flexible.

Optimization of Brotli compression for PDF using custom dictionaries

PDF will soon support Brotli compression as a general compression algorithm that improves on Flate. Standard Brotli (defined by RFC 7932) uses a predefined dictionary that was optimized for web technologies. This research will investigate possible benefits from defining an alternative dictionary that specifically targets PDF either holistically or for specific kinds of compressed data streams (fonts, color profiles, content streams, etc.). Both GhostScript (OSS) and MuPDF (OSS) have been modified to support Brotli compression providing a convenient test platform.

Domains:

Computing science
Image processing

Extensions to the formal Arlington PDF Data Model

The Arlington PDF Data Model is an open-source data model of all PDF objects in the PDF Document Object Model. It is defined as a set of text-based TSV files that use a predicate syntax to express data integrity requirements as they are expressed in ISO 32000. Extensions to the model might include extending the predicate grammar; formalizing the predicate grammar using ANTLR4 (some private work already exists in this area); creation of JSON or YAML equivalent data (e.g. for use with Metanorma publishing); etc. Improvements to the Python, Java and C++ proof-of-concept implementations would also be very useful as many third-parties look to integrate and adopt the Arlington PDF Data Model - see the Github Issues for a list.

Domains:

Computing science
Document engineering
Formal methods
Data science

Improved visualizations of the Arlington PDF Data Model

The Arlington PDF Data Model is an open-source data model of all PDF objects in the PDF Document Object Model. Some very rudimentary 3D and VR visualizations have been developed, but these are slow and not particularly informative. Improved visualizations could help developers better understand the PDF DOM such as identifying cycles (loops); alternate paths to objects; etc.

Domains:

Computing science
Document engineering
UX/HCI
Data science
Data visualization

Formalization of PDF content stream syntax

There is no formalized grammar representation for the PDF graphic operators and operands that are in PDF content streams. An early attempt at an ANTLR4 grammar called "pdfcop" was developed by iText, but this is insufficient has it allows invalid syntax. Such a grammar would be a very useful extension to the open source Arlington PDF Data Model.

Domains:

Computing science
Document engineering
Formal methods
Data science

Formalization of PDF file structure and layout

There is no formalized grammar representation for the PDF file structure and layout, such as mathematical assertions or predicates related to cross-reference information, file offsets, etc. A formalism would be a very useful extension to the open source Arlington PDF Data Model and helpful to cyber-security researchers for understanding "shadow attacks", polyglots, etc.

Domains:

Computing science
Document engineering
Formal methods
Cyber-security
Data science

Improvements to "pdf-cos-syntax" VSCode extension

The multi-platform Visual Studio Code (VSCode) extension for PDF ("pdf-cos-syntax") has proven to be very popular, especially as a learning tool for PDF, as a page description language similar to HTML. It is open-source (GitHub repo") and developed in Typescript, but cannot use a proper PDF parser due to partial correctness while editing. This project would look to extend the capabilities, problem identification, etc.

Domains:

Computing science
Document engineering
Technology education

Educational videos and infographics

Resources for supporting developers and stakeholders new to PDF in understanding PDF are few. And ISO standards and specifications are not textbooks! For example, the PDF Association cheat sheets greatly condense content but assume an existing understanding. Most people looking to gain a basic understanding of PDF come with some limited understanding of web technologies (HTML, CSS, SVG, etc.) and, although many concepts are similar, there are critical differences. This project looks to develop introductory (i.e. simple, but still technically accurate) educational support materials for PDF in the form of videos and infographics in a similar fashion to the student projects developed by Ghent WorkGroup for the graphic arts industry.

Domains:

Computing science
Document engineering
Technology education

Market segment research

PDF is a global and ubiquitous digital document format used across every industry, organization, and domain. Over the last 30 years, larger industry stakeholder groups have leveraged international standards to adapt general PDF format to their needs, resulting in "PDF/A" for long-term preservation, archiving and record keeping; "PDF/X" for the graphic arts and reproduction industries; "PDF/R" for the scanning; and other other PDF variants. However within smaller domains and niche industries, PDF has also been adapted to suit more unique industry needs by individuals and organizations who lack a full understanding of PDF, who may not appreciate vendor biases, or where adoptions have not been maintained as PDF standards have evolved. Features such as interoperable metadata, rich content semantics, open data, etc. are often overlooked by these niche adoptions.

This research topic looks to identify niche applications of PDF, assess the current situation (such as vendor bias or being outdated), and identify opportunities by providing market research. Specific niche domains that might be researched include:

medical 3D PDF;
PDF used for government and regulatory submissions (e.g. FDA; SCC; in EU regulations);
PDF used in electronic financial workflow (beyond ZUGFeRD, Order-X, and Factur-X to Brazil Recibo Provisório de Serviços, (RPS PDF), Chilean DTE PDF, Peru electronic proof of payment CPE, FatturaPA - Italian XML, Finvoice, etc.);
legal workflows in justice and court systems (including e-discovery).

Domains:

Business studies
Marketing
Legal studies
Biomedical

Regulated use of PDF in government legislation

Many government departments accept PDF documents within only a few defining rules around what is acceptable or unacceptable. This market research topic is to investigate government legislation and regulations (e.g. in the USA, in the EU) as to what the relevant rules for PDF are - are they up-to-date, out dated, or regularly maintained? Do they reference ISO international standards on PDF? Do they provide a level playing field for vendors (vendor neutral)? are they harmonious (e.g. support archival and accessibility needs)? Are the rules testable? etc.

Domains:

Business studies
Legal studies
Political science
Education

Interactive comprehensive technical glossary

Like any complex technology that has evolved over time, PDF has a lot of technical terms and acronyms, whether they be formal, informal, or colloquial, resulting from developer slang or terminology used in products. The PDF Association has organically developed several resources which attempt to cover this: Glossary of PDF terms, Glossary of accessibility terminology in PDF, and PDF cheat sheets. This project looks to consolidate and improve the data behind these assets and publish as a well-structured interactive technical glossary resource that can support both new and experienced PDF stakeholders from around the globe. One resource being considered is this WordPress plugin.

Domains:

Computer science
Document engineering
Information design
UX/HCI

Featured articles

Discover pdfa.org

Key resources

Get involved

PDF Research Portal