In Part 1 of this series, we discussed intelligent document processing (IDP), and how IDP can accelerate claims processing use cases in the insurance industry. We discussed how we can use AWS AI services to accurately categorize claims documents along with supporting documents. We also discussed how to extract various types of documents in an insurance claims package, such as forms, tables, or specialized documents such as invoices, receipts, or ID documents. We looked into the challenges in legacy document processes, which is time-consuming, error-prone, expensive, and difficult to process at scale, and how you can use AWS AI services to help implement your IDP pipeline.
In this post, we walk you through advanced IDP features for document extraction, querying, and enrichment. We also look into how to further use the extracted structured information from claims data to get insights using AWS Analytics and visualization services. We highlight on how extracted structured data from IDP can help against fraudulent claims using AWS Analytics services.
Intelligent document processing with AWS AI and Analytics services in the insurance industry
|
The following diagram illustrates the phases if IDP using AWS AI services. In Part 1, we discussed the first three phases of the IDP workflow. In this post, we expand on the extraction step and the remaining phases, which include integrating IDP with AWS Analytics services.
We use these analytics services for further insights and visualizations, and to detect fraudulent claims using structured, normalized data from IDP. The following diagram illustrates the solution architecture.
The phases we discuss in this post use the following key services:
Before you get started, refer to Part 1 for a high-level overview of the insurance use case with IDP and details about the data capture and classification stages.
For more information regarding the code samples, refer to our GitHub repo.
In Part 1, we saw how to use Amazon Textract APIs to extract information like forms and tables from documents, and how to analyze invoices and identity documents. In this post, we enhance the extraction phase with Amazon Comprehend to extract default and custom entities specific to custom use cases.
Insurance carriers often come across dense text in insurance claims applications, such a patient’s discharge summary letter (see the following example image). It can be difficult to automatically extract information from such types of documents where there is no definite structure. To address this, we can use the following methods to extract key business information from the document:
We run the following code on the sample medical transcription document:
The following screenshot shows a collection of entities identified in the input text. The output has been shortened for the purposes of this post. Refer to the GitHub repo for a detailed list of entities.
The response from the DetectEntities
API includes the default entities. However, we’re interested in knowing specific entity values, such as the patient’s name (denoted by the default entity PERSON
), or the patient’s ID (denoted by the default entity OTHER
). To recognize these custom entities, we train an Amazon Comprehend custom entity recognizer model. We recommend following the comprehensive steps on how to train and deploy a custom entity recognition model in the GitHub repo.
After we deploy the custom model, we can use the helper function get_entities()
to retrieve custom entities like PATIENT_NAME
and PATIENT_D
from the API response:
The following screenshot shows our results.
In the document enrichment phase, we perform enrichment functions on healthcare-related documents to draw valuable insights. We look at the following types of enrichment:
Documents such as medical providers’ notes and clinical trial reports include dense medical text. Insurance claims carriers need to identify the relationships among the extracted health information from this dense text and link them to medical ontologies like ICD-10-CM, RxNorm, and SNOMED CT codes. This is very valuable in automating claim capture, validation, and approval workflows for insurance companies to accelerate and simplify claim processing. Let’s look at how we can use the Amazon Comprehend Medical InferICD10CM
API to detect possible medical conditions as entities and link them to their codes:
For the input text, which we can pass in from the Amazon Textract DetectDocumentText
API, the InferICD10CM
API returns the following output (the output has been abbreviated for brevity).
Similarly, we can use the Amazon Comprehend Medical InferRxNorm
API to identify medications and the InferSNOMEDCT
API to detect medical entities within healthcare-related insurance documents.
Insurance claims packages require a lot of privacy compliance and regulations because they contain both PII and PHI data. Insurance carriers can reduce compliance risk by redacting information like policy numbers or the patient’s name.
Let’s look at an example of a patient’s discharge summary. We use the Amazon Comprehend DetectPiiEntities
API to detect PII entities within the document and protect the patient’s privacy by redacting these entities:
We get the following PII entities in the response from the detect_pii_entities()
API :
We can then redact the PII entities that were detected from the documents by utilizing the bounding box geometry of the entities from the document. For that, we use a helper tool called amazon-textract-overlayer
. For more information, refer to Textract-Overlayer. The following screenshots compare a document before and after redaction.
Similar to the Amazon Comprehend DetectPiiEntities
API, we can also use the DetectPHI
API to detect PHI data in the clinical text being examined. For more information, refer to Detect PHI.
In the document review and validation phase, we can now verify if the claim package meets the business’s requirements, because we have all the information collected from the documents in the package from earlier stages. We can do this by introducing a human in the loop that can review and validate all the fields or just an auto-approval process for low dollar claims before sending the package to downstream applications. We can use Amazon Augmented AI (Amazon A2I) to automate the human review process for insurance claims processing.
Now that we have all required data extracted and normalized from claims processing using AI services for IDP, we can extend the solution to integrate with AWS Analytics services such as AWS Glue and Amazon Redshift to solve additional use cases and provide further analytics and visualizations.
In this post, we implement a serverless architecture where the extracted and processed data is stored in a data lake and is used to detect fraudulent insurance claims using ML. We use Amazon Simple Storage Service (Amazon S3) to store the processed data. We can then use AWS Glue or Amazon EMR to cleanse the data and add additional fields to make it consumable for reporting and ML. After that, we use Amazon Redshift ML to build a fraud detection ML model. Finally, we build reports using Amazon QuickSight to get insights into the data.
For the purpose of this example, we have created a sample dataset the emulates the output of an ETL (extract, transform, and load) process, and use AWS Glue Data Catalog as the metadata catalog. First, we create a database named idp_demo
in the Data Catalog and an external schema in Amazon Redshift called idp_insurance_demo
(see the following code). We use an AWS Identity and Access Management (IAM) role to grant permissions to the Amazon Redshift cluster to access Amazon S3 and Amazon SageMaker. For more information about how to set up this IAM role with least privilege, refer to Cluster and configure setup for Amazon Redshift ML administration.
The next step is to create an external table in Amazon Redshift referencing the S3 location where the file is located. In this case, our file is a comma-separated text file. We also want to skip the header row from the file, which can be configured in the table properties section. See the following code:
After we create the external table, we prepare our dataset for ML by splitting it into training set and test set. We create a new external table called claim_train
, which consists of all records with ID <= 85000 from the claims table. This is the training set that we train our ML model on.
We create another external table called claim_test
that consists of all records with ID >85000 to be the test set that we test the ML model on:
Now we create the model using the CREATE MODEL command (see the following code). We select the relevant columns from the claims_train
table that can determine a fraudulent transaction. The goal of this model is to predict the value of the fraud
column; therefore, fraud
is added as the prediction target. After the model is trained, it creates a function named insurance_fraud_model
. This function is used for inference while running SQL statements to predict the value of the fraud
column for new records.
After we create the model, we can run queries to check the accuracy of the model. We use the insurance_fraud_model
function to predict the value of the fraud
column for new records. Run the following query on the claims_test
table to create a confusion matrix:
After we create the new model, as new claims data is inserted into the data warehouse or data lake, we can use the insurance_fraud_model
function to calculate the fraudulent transactions. We do this by first loading the new data into a temporary table. Then we use the insurance_fraud_model
function to calculate the fraud
flag for each new transaction and insert the data along with the flag into the final table, which in this case is the claims
table.
When the data is available in Amazon Redshift, we can create visualizations using QuickSight. We can then share the QuickSight dashboards with business users and analysts. To create the QuickSight dashboard, you first need to create an Amazon Redshift dataset in QuickSight. For instructions, refer to Creating a dataset from a database.
After you create the dataset, you can create a new analysis in QuickSight using the dataset. The following are some sample reports we created:
fraud
field – This chart shows us the proportion of fraudulent transactions compared to the total number of transactions in a particular state.fraud
field – This chart shows us the proportion of dollar amount of fraudulent transactions compared to the total dollar amount of transactions in a particular state.fraud
field – This chart shows us how many claims were filed for each insurance company and how many of them are fraudulent.To prevent incurring future charges to your AWS account, delete the resources that you provisioned in the setup by following the instructions in the Cleanup section in our repo.
In this two-part series, we saw how to build an end-to-end IDP pipeline with little or no ML experience. We explored a claims processing use case in the insurance industry and how IDP can help automate this use case using services such as Amazon Textract, Amazon Comprehend, Amazon Comprehend Medical, and Amazon A2I. In Part 1, we demonstrated how to use AWS AI services for document extraction. In Part 2, we extended the extraction phase and performed data enrichment. Finally, we extended the structured data extracted from IDP for further analytics, and created visualizations to detect fraudulent claims using AWS Analytics services.
We recommend reviewing the security sections of the Amazon Textract, Amazon Comprehend, and Amazon A2I documentation and following the guidelines provided. To learn more about the pricing of the solution, review the pricing details of Amazon Textract, Amazon Comprehend, and Amazon A2I.
Podcasts are a fun and easy way to learn about machine learning.
TL;DR We asked o1 to share its thoughts on our recent LNM/LMM post. https://www.artificial-intelligence.show/the-ai-podcast/o1s-thoughts-on-lnms-and-lmms What…
Palantir and Grafana Labs’ Strategic PartnershipIntroductionIn today’s rapidly evolving technological landscape, government agencies face the…
Amazon SageMaker Pipelines includes features that allow you to streamline and automate machine learning (ML)…
When it comes to AI, large language models (LLMs) and machine learning (ML) are taking…
Cohere's Command R7B uses RAG, features a context length of 128K, supports 23 languages and…