New large-scale PDF corpus now publicly available

This new corpus – nearly 8 million PDFs totaling about 8 TB – was gathered from across the web in July/August of 2021.



With PDF the archive becomes the “Noah’s Ark” for every document

To survive a great flood Noah brought pairs from every species aboard his ark. But is this approach – gathering the full diversity of document formats – genuinely suitable for archiving?

Pixartprinting is up in the ‘Cloud’s with License Server

Berlin. callas software, a leading provider of automated PDF quality assurance and archiving solutions, recently released License Server, a dynamic way of licensing callas products. With the arrival of License Server, Pixartprinting could finally migrate to a cloud-based infrastructure where …

Trump’s call with Zelensky: the PDF

Earlier this year we covered the release of the Mueller Report covering the Special Counsel’s investigation into Russian interference in the 2016 US election. We highlighted how, in the digital age, the choices made in releasing that document made it …

Artifex Software is pleased to announce the release of MuPDF 1.16

NOVATO, CA, September 19, 2019 – Artifex Software, Inc., a leading provider of core technologies for document handling and management, is pleased to report the release of MuPDF 1.16. With this latest release, MuPDF now offers PDF support for the …

Show of industry support for labels and packaging standard

Cambridge, UK, 12th September 2019: Major players in the label and package printing industries are united behind the ISO standard for PDF Processing Steps which standardizes how technical marks are included in PDF files, to aid increased automation and reduce …

axaio software, Four Pees and callas software to host VIP Event focusing on pdfToolbox 11

Experts will demonstrate practical PDF applications for the fields of print, publishing and packaging.

ActivePDF Releases Toolkit Ultimate: The “All-in-One” Developer C# PDF Library for Complete Digital Transformation

ActivePDF’s award-winning C# PDF library for developers now includes API technology to rasterize images, reduce file sizes, redact sensitive data, search and extract data within PDF files, print PDF to paper, and much more.

Datalogics Offers PDF/A Parts 2 and 3 and ZUGFeRD Support

PDF/A has always been an important part of document management, and the Adobe PDF Library offers support for creating PDF/A documents that can adhere to Part 2 and Part 3 of the standard.  This means you can create a Part …

Announcing PDF Declarations

Today the PDF Association publishes a new feature for PDF: a standardized mechanism allowing authors to define and share their own document profiles leveraging PDF technology. ISO-standardized subsets of PDF such as PDF/A, PDF/UA and PDF/X already include identification mechanisms. …

TWAIN Direct with PDF/Raster released

The TWAIN Working Group, a liaison member of the PDF Association, has just announced the release of TWAIN Direct, their next-generation open source image-acquisition technology. As the TWAIN Direct website points out, historically, application developers’ choice of image capture API …

