Extract Content Structure from PDFs Using AI Powered Adobe PDF Extract API

Most companies need to extract specific content from unstructured documents into a business system of record to support their digital transformation needs. They struggle to do this at scale as disproportionately large number of these documents are in PDF format. It takes many tools to successfully extract out content and requires a deep understanding of the PDF file format to really make it work. Quality is often poor, so post processing is required to make the output usable.

Introducing Adobe PDF Extract API

We are excited to make available Adobe PDF Extract API (beta), an AI service, that automatically understands content structure to extract text, tables and images from virtually any PDF document, digital or scan.

PDF Extract API goes beyond OCR and

extracts text, tables and images
provides structural understanding of the content (headings, paragraphs, lists, reading order and many more…)
extracts table structure and cell data
extracts tables and images as .png

‘PDF Extract API’ uses Adobe Sensei to bring the power of artificial intelligence (AI) and machine learning to the process of extracting content from PDF. You may have already seen it in action if you’ve ever viewed a PDF using Liquid Mode in Adobe Acrobat Reader on a mobile device. Liquid Mode can re-layout many PDF documents so that it’s easier to consume on a smaller screen. Liquid Mode uses Adobe Sensei to deconstruct the page layout and then reorganizes the content to fit the screen. This Sensei technology is also at the core of the Extract service. It abstracts out the complexities of working with PDF format and provides a richer output that can be consumed by any application.

PDF Extract API provides simple to use API actions that can automatically extract content from PDF documents without the need for any custom code or ML experience. The following image shows an example page from a document and how each element on the page is expressed in the JSON output. PDF Extract API turns the PDF black box into something that is far more familiar to developers.

Use Cases

Developers can take advantage of the PDF Extract API operations using the SDK for some of the following use cases.

Extract Specific Content for Process Automation

Contracts, financial reports, policy documents, invoices and many more types of documents are used in current business processes by many companies. You may want to extract only a section of content that is relevant to your business workflow and ingest it into a business system of record. A large portion of these documents are scanned versions and in PDF format.

Traditionally, this would be a highly manual process. Today, developers are leveraging recent advancements in NLP and RPA to augment humans and automate these otherwise manual workflows. Here you need to not only extract the text, but you also need to know how each text element relates to other content within the document. This contextual information is important and can be consumed by downstream applications to efficiently identify the section of content that is relevant to the workflow.

The PDF Extract API can extract text, tables and images and provide structural understanding of this content for digital as well scanned PDF documents. Structural elements provided are: headings, paragraphs, lists, tables, figures and many more elements.

Publish PDF Content

In other cases, you may want to repurpose the PDF content and present it in another form, for example, responsive HTML or for text to audio. In this case, you not only need to extract the text in proper reading order, but you also need to understand the document hierarchy, structure and styling info.

The PDF Extract API can detect the correct reading order for complex, multi-column content as well as for content across pages. In addition to this, it provides positional and styling information that can be easily used to represent the PDF content for any form factor.

Extract Table Data

Further, text arranged in a grid tells us that it’s a table and this is an area that is particularly challenging for content extraction from PDF documents. For humans, scanning a row of text in a table is trivial. But for even the smartest AI, tables can be troublesome because the goal is not to present the content visually. Instead, the goal is to import the text as data into another system or into excel for analysis. Today, much of this process is inefficient and requires manually rekeying of table data.

The PDF Extract API can detect bordered as well as un-bordered tables, understand table structure (header column/row, cells, etc.) and extract data from table cells.

Final Thoughts

We are excited to offer the PDF Extract API service to our early adopter customers through a private beta program. If you’re interested in participating in the private beta, please request access here.

You can also now create compelling PDF experiences, including viewing and manipulating PDFs with Adobe PDF Embed API and PDF Tools API. The APIs are part of an ecosystem, aimed at helping 3rd party application developers extend Adobe products and technology in their custom applications and solutions.