PDF Association logo

Discover pdfa.org

Key resources

Get involved

How do you find the right PDF technology vendor?
Use the Solution Agent to ask the entire PDF communuity!
The PDF Association celebrates its members’ public statements
of support
for ISO-standardized PDF technology.

Member Area

Timothy Allison


Tim is an internationally recognized expert on file processing with more than 10 years experience as a committer and 6 years as chair/VP on the Apache Tika project. Tim led the movement to gather web-scale corpora for parser regression testing in several open source communities. He also collaborated with the PDF Association to gather the cc-main-2021-31-pdf-untruncated/ corpus, a collection of 8 million PDFs from the internet. In addition to corpus development, he has implemented and contributed several tools within the Tika project to help evaluate content extraction without ground truth. These tools help clients decrease risk during tool migration/upgrades and improve content extraction at scale.

Tim is also passionate about helping clients to measure, tune and improve search relevancy across the enterprise and in helping them develop and deploy advanced search techniques, including neural and hybrid search.

Tim founded Rhapsode Consulting LLC to serve a range of clients from boutique software shops to major cloud providers and government agencies. Rhapsode Consulting works with a broad range of industries for whom file processing and search are mission critical, including parser development, enterprise knowledge management, digital preservation, e-discovery, enterprise/site search.

He a member of the Apache Software Foundation (ASF), the chair/VP of Apache Tika, and a committer on Apache Nutch (2023), Apache OpenNLP (2020), Apache Lucene/Solr (2018), Apache PDFBox (2016) and Apache POI (2013). Tim holds a Ph.D. in Classical Studies and in a former life was a professor of Latin and Greek.

Find me here on LinkedIn.

Capabilities

Deep knowledge of PDF and other popular office formats and parsers for those formats in the Java/Apache Software Foundation ecosystem. Helps clients develop evaluation methodologies and mitigation strategies for "challenging" documents and common document challenges. Helps clients dramatically improve robustness and scalability of processing pipelines and is available to build custom parsers or to add advanced capabilities to parsers in the Tika framework. Deep knowledge of Apache Solr, Elasticsearch and OpenSearch. Proficiency with postgresql, h2 and other SQL databases. Works closely with cloud engineers to deploy and scale file processing and search. Expert level Java, some Python. Has fun with Haystack and Langchain and Retrieval Augmented Generation generally.

Availability

Tim is currently booked through the next few months, but he is always happy to scope projects and chat about file processing (especially PDFs!) as well as relevance tuning for search systems.

 



WordPress Cookie Notice by Real Cookie Banner