Organizations across industries want to categorize and extract insights from high volumes of documents of different formats. Manually processing these documents to classify and extract information remains expensive, error prone, and difficult to scale. Advances in generative artificial intelligence (AI) have given rise to intelligent document processing (IDP) solutions that can automate the document classification, and create a cost-effective classification layer capable of handling diverse, unstructured enterprise documents.
Categorizing documents is an important first step in IDP systems. It helps you determine the next set of actions to take depending on the type of document. For example, during the claims adjudication process, the accounts payable team receives the invoice, whereas the claims department manages the contract or policy documents. Traditional rule engines or ML-based classification can classify the documents, but often reach a limit on types of document formats and support for the dynamic addition of a new classes of document. For more information, see Amazon Comprehend document classifier adds layout support for higher accuracy.
In this post, we discuss document classification using the Amazon Titan Multimodal Embeddings model to classify any document types without the need for training.
Amazon recently introduced Titan Multimodal Embeddings in Amazon Bedrock. This model can create embeddings for images and text, enabling the creation of document embeddings to be used in new document classification workflows.
It generates optimized vector representations of documents scanned as images. By encoding both visual and textual components into unified numerical vectors that encapsulate semantic meaning, it enables rapid indexing, powerful contextual search, and accurate classification of documents.
As new document templates and types emerge in business workflows, you can simply invoke the Amazon Bedrock API to dynamically vectorize them and append to their IDP systems to rapidly enhance document classification capabilities.
Let’s examine the following document classification solution with the Amazon Titan Multimodal Embeddings model. For optimal performance, you should customize the solution to your specific use case and existing IDP pipeline setup.
This solution classifies documents using vector embedding semantic search by matching an input document to an already indexed gallery of documents. We use the following key components:
In the labeling process, a sample set of business documents like invoices, bank statements, or prescriptions are converted into embeddings using the Amazon Titan Multimodal Embeddings model and stored in a vector database against predefined labels. The Amazon Titan Multimodal Embedding model was trained using the Euclidean L2 algorithm and therefore for best results the vector database used should support this algorithm.
The following architecture diagram illustrates how you can use the Amazon Titan Multimodal Embeddings model with documents in an Amazon Simple Storage Service (Amazon S3) bucket for image gallery creation.
The workflow consists of the following steps:
When a new document needs classification, the same embedding model is used to convert the query document into an embedding. Then, a semantic similarity search is performed on the vector database using the query embedding. The label retrieved against the top embedding match will be the classification label for the query document.
The following architecture diagram illustrates how to use the Amazon Titan Multimodal Embeddings model with documents in an S3 bucket for image classification.
The workflow consists of the following steps:
To help you test the solution with your own documents, we have created an example Python Jupyter notebook, which is available on GitHub.
To run the notebook, you need an AWS account with appropriate AWS Identity and Access Management (IAM) permissions to call Amazon Bedrock. Additionally, on the Model access page of the Amazon Bedrock console, make sure that access is granted for the Amazon Titan Multimodal Embeddings model.
In the following steps, replace each user input placeholder with your own information:
inputImageB64
is a base64 encoded byte array representing your document. The response from Amazon Bedrock contains the embeddings.In addition, the Euclidean L2 distance between the image on hand and the found image is also returned. If the image is an exact match, this value would be 0. The larger this value is, the further apart the images are in similarity.
In this section, we discuss additional considerations for using the solution effectively. This includes data privacy, security, integration with existing systems, and cost estimates.
The AWS shared responsibility model applies to data protection in Amazon Bedrock. As described in this model, AWS is responsible for protecting the global infrastructure that runs all of the AWS Cloud. Customers are responsible for maintaining control over their content that is hosted on this infrastructure. As a customer, you are responsible for the security configuration and management tasks for the AWS services that you use.
Amazon Bedrock avoids using customer prompts and continuations to train AWS models or share them with third parties. Amazon Bedrock doesn’t store or log customer data in its service logs. Model providers don’t have access to Amazon Bedrock logs or access to customer prompts and continuations. As a result, the images used for generating embeddings through the Amazon Titan Multimodal Embeddings model are not stored or employed in training AWS models or external distribution. Additionally, other usage data, such as timestamps and logged account IDs, is excluded from model training.
The Amazon Titan Multimodal Embeddings model underwent training with the Euclidean L2 algorithm, so the vector database being used should be compatible with this algorithm.
At the time of writing this post, as per Amazon Bedrock Pricing for the Amazon Titan Multimodal Embeddings model, the following are the estimated costs using on-demand pricing for this solution:
To avoid incurring future charges, delete the resources you created, such as the Amazon SageMaker notebook instance, when not in use.
In this post, we explored how you can use the Amazon Titan Multimodal Embeddings model to build an inexpensive solution for document classification in the IDP workflow. We demonstrated how to create an image gallery of known documents and perform similarity searches with new documents to classify them. We also discussed the benefits of using multimodal image embeddings for document classification, including their ability to handle diverse document types, scalability, and low latency.
As new document templates and types emerge in business workflows, developers can invoke the Amazon Bedrock API to vectorize them dynamically and append to their IDP systems to rapidly enhance document classification capabilities. This creates an inexpensive, infinitely scalable classification layer that can handle even the most diverse, unstructured enterprise documents.
Overall, this post provides a roadmap for building an inexpensive solution for document classification in the IDP workflow using Amazon Titan Multimodal Embeddings.
As next steps, check out What is Amazon Bedrock to start using the service. And follow Amazon Bedrock on the AWS Machine Learning Blog to keep up to date with new capabilities and use cases for Amazon Bedrock.
TL;DR A conversation with 4o about the potential demise of companies like Anthropic. As artificial…
Whether a company begins with a proof-of-concept or live deployment, they should start small, test…
Digital tools are not always superior. Here are some WIRED-tested agendas and notebooks to keep…
Machine learning (ML) models are built upon data.
Editor’s note: This is the second post in a series that explores a range of…
David J. Berg*, David Casler^, Romain Cledat*, Qian Huang*, Rui Lin*, Nissan Pow*, Nurcan Sonmez*,…