Intelligent document processing (IDP) has seen widespread adoption across enterprise and government organizations. Gartner estimates the IDP market will grow more than 100% year over year, and is projected to reach $4.8 billion in 2022.
IDP helps transform structured, semi-structured, and unstructured data from a variety of document formats into actionable information. Processing unstructured data has become much easier with the advancements in optical character recognition (OCR), machine learning (ML), and natural language processing (NLP).
IDP techniques have grown tremendously, allowing us to extract, classify, identify, and process unstructured data. With AI/ML powered services such as Amazon Textract, Amazon Transcribe, and Amazon Comprehend, building an IDP solution has become much easier and doesn’t require specialized AI/ML skills.
In this post, we demonstrate how to use Amazon Textract to extract meaningful, actionable data from a wide range of complex multi-format PDF files. PDF files are challenging; they can have a variety of data elements like headers, footers, tables with data in multiple columns, images, graphs, and sentences and paragraphs in different formats. We explore the data extraction phase of IDP, and how it connects to the steps involved in a document process, such as ingestion, extraction, and postprocessing.
Amazon Textract provides various options for data extraction, based on your use case. You can use forms, tables, query-based extractions, handwriting recognition, invoices and receipts, identity documents, and more. All the extracted data is returned with bounding box coordinates. This solution uses Amazon Textract IDP CDK constructs to build the document processing workflow that handles Amazon Textract asynchronous invocation, raw response extraction, and persistence in Amazon Simple Storage Service (Amazon S3). This solution adds an Amazon Textract postprocessing component to the base workflow to handle paragraph-based text extraction.
The following diagram shows the document processing flow.
The document processing flow contains the following steps:
uploads/
prefix, which triggers the AWS Step Functions asynchronous workflow.SimpleAsyncWorkflow
Decider validates the PDF document. This step prevents processing invalid documents.Async
API, handling Amazon Simple Notification Service (Amazon SNS) messages and workflow processing. The following are some high-level steps: To deploy the solution, launch the AWS Cloud Development Kit (AWS CDK) using AWS Cloud9 or from your local system. If you’re launching from your local system, you need to have the AWS CDK and Docker installed. Follow the instructions in the GitHub repo for deployment.
The stack creates the key components depicted in the architecture diagram.
The GitHub repo contains the following sample files:
To test the solution, complete the following steps:
TextractTempOutputJsonPath
. The below screenshot of the Amazon S3 console shows the Amazon Textract paginated results (in this case objects 1 and 2) created by Amazon Textract. The postprocessing task stores the extracted paragraphs from the sample PDF as extracted-text.csv.
extracted-text.csv
file to view the extracted content.The sample_climate_change.pdf
file has sentences flowing across pages, as shown in the following screenshot.
The postprocessor identifies and ignores the header and footer, and combines the text across pages into one paragraph. The extracted text for the combined paragraph should look like:
“Impacts on this scale could spill over national borders, exacerbating the damage further. Rising sea levels and other climate-driven changes could drive millions of people to migrate: more than a fifth of Bangladesh could be under water with a 1m rise in sea levels, which is a possibility by the end of the century. Climate-related shocks have sparked violent conflict in the past, and conflict is a serious risk in areas such as West Africa, the Nile Basin and Central Asia.”
The sample_multi_column.pdf
file has two columns of text with headers and footers, as shown in the following screenshot.
The postprocessor identifies and ignores the header and footer, processes the text in the columns from left to right, and combines incomplete sentences across pages. The extracted text should construct paragraphs from text in the left column and separate paragraphs from text in the right column. The last line in the right column is incomplete on that page and continues in the left column of the next page; the postprocessor should combine them as one paragraph.
With Amazon Textract, you pay as you go based on the number of pages in the document. Refer to Amazon Textract pricing for actual costs.
When you’re finished experimenting with this solution, clean up your resources by using the AWS CloudFormation console to delete all the resources deployed in this example. This helps you avoid continuing costs in your account.
You can use the solution presented in this post to build an efficient document extraction workflow and process the extracted document according to your needs. If you’re building an intelligent document processing system, you can further process the extracted document using Amazon Comprehend to get more insights about the document.
For more information about Amazon Textract, visit Amazon Textract resources to find video resources and blog posts, and refer to Amazon Textract FAQs. For more information about the IDP reference architecture, refer to Intelligent Document Processing. Please share your thoughts with us in the comments section, or in the issues section of the project’s GitHub repository.
Podcasts are a fun and easy way to learn about machine learning.
TL;DR We asked o1 to share its thoughts on our recent LNM/LMM post. https://www.artificial-intelligence.show/the-ai-podcast/o1s-thoughts-on-lnms-and-lmms What…
Palantir and Grafana Labs’ Strategic PartnershipIntroductionIn today’s rapidly evolving technological landscape, government agencies face the…
Amazon SageMaker Pipelines includes features that allow you to streamline and automate machine learning (ML)…
When it comes to AI, large language models (LLMs) and machine learning (ML) are taking…
Cohere's Command R7B uses RAG, features a context length of 128K, supports 23 languages and…