In this post, we discuss how United Airlines, in collaboration with the Amazon Machine Learning Solutions Lab, build an active learning framework on AWS to automate the processing of passenger documents.
“In order to deliver the best flying experience for our passengers and make our internal business process as efficient as possible, we have developed an automated machine learning-based document processing pipeline in AWS. In order to power these applications, as well as those using other data modalities like computer vision, we need a robust and efficient workflow to quickly annotate data, train and evaluate models, and iterate quickly. Over the course a couple months, United partnered with the Amazon Machine Learning Solutions Labs to design and develop a reusable, use case-agnostic active learning workflow using AWS CDK. This workflow will be foundational to our unstructured data-based machine learning applications as it will enable us to minimize human labeling effort, deliver strong model performance quickly, and adapt to data drift.”
– Jon Nelson, Senior Manager of Data Science and Machine Learning at United Airlines.
United’s Digital Technology team is made up of globally diverse individuals working together with cutting-edge technology to drive business outcomes and keep customer satisfaction levels high. They wanted to take advantage of machine learning (ML) techniques such as computer vision (CV) and natural language processing (NLP) to automate document processing pipelines. As part of this strategy, they developed an in-house passport analysis model to verify passenger IDs. The process relies on manual annotations to train ML models, which are very costly.
United wanted to create a flexible, resilient, and cost-efficient ML framework for automating passport information verification, validating passenger’s identities and detecting possible fraudulent documents. They engaged the ML Solutions Lab to help achieve this goal, which allows United to continue delivering world-class service in the face of future passenger growth.
Our joint team designed and developed an active learning framework powered by the AWS Cloud Development Kit (AWS CDK), which programmatically configures and provisions all necessary AWS services. The framework uses Amazon SageMaker to process unlabeled data, creates soft labels, launches manual labeling jobs with Amazon SageMaker Ground Truth, and trains an arbitrary ML model with the resulting dataset. We used Amazon Textract to automate information extraction from specific document fields such as name and passport number. On a high level, the approach can be described with the following diagram.
The primary dataset for this problem is comprised of tens of thousands of main-page passport images from which personal information (name, date of birth, passport number, and so on) must be extracted. Image size, layout, and structure vary depending on the document issuing country. We normalize these images into a set of uniform thumbnails, which constitute the functional input for the active learning pipeline (auto-labeling and inference).
The second dataset contains JSON line formatted manifest files that relate raw passport images, thumbnail images, and label information such as soft labels and bounding box positions. Manifest files serve as a metadata set storing results from various AWS services in a unified format, and decouple the active learning pipeline from downstream services used by United. The following diagram illustrates this architecture.
The following code is an example manifest file:
The solution includes two main components:
The ML framework is responsible for training the ML model and deploying it as a SageMaker endpoint. The auto-labeling pipeline focuses on automating SageMaker Ground Truth jobs and sampling images for labeling through those jobs.
The two components are decoupled from each other and only interact through the set of labeled images produced by the auto-labeling pipeline. That is, the labeling pipeline creates labels that are later used by the ML framework to train the ML model.
The ML Solutions Lab team built the ML framework using the Hugging Face implementation of the state-of-art LayoutLMV2 model (LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding, Yang Xu, et al.). Training was based on Amazon Textract outputs, which served as a preprocessor and produced bounding boxes around text of interest. The framework uses distributed training and runs on a custom Docker container based on the SageMaker pre-built Hugging Face image with additional dependencies (dependencies that are missing in the pre-built SageMaker Docker image but required for Hugging Face LayoutLMv2).
The ML model was trained to classify document fields in the following 11 classes:
The training pipeline can be summarized in the following diagram.
First, we resize and normalize a batch of raw images into thumbnails. At the same time, a JSON line manifest file with one line per image is created with information about raw and thumbnail images from the batch. Next, we use Amazon Textract to extract text bounding boxes in the thumbnail images. All information produced by Amazon Textract is recorded in the same manifest file. Finally, we use the thumbnail images and manifest data to train a model, which is later deployed as a SageMaker endpoint.
We developed an auto-labeling pipeline designed to perform the following functions:
The uncertainty sampling strategy reduces the number of images sent to the human labeling job by selecting images that would likely contribute the most to improving model accuracy. Because human labeling is an expensive task, such sampling is an important cost reduction technique. We support four sampling strategies, which can be selected as a parameter stored in Parameter Store, a capability of AWS Systems Manager:
The entire auto-labeling workflow was implemented with AWS Step Functions, which orchestrates the processing job (called the elastic endpoint for batch inference), uncertainty sampling, and SageMaker Ground Truth. The following diagram illustrates the Step Functions workflow.
The main factor influencing labeling costs is manual annotation. Before deploying this solution, the United team had to use a rule-based approach, which required expensive manual data annotation and third-party parsing OCR techniques. With our solution, United reduced their manual labeling workload by manually labeling only images that would result in the largest model improvements. Because the framework is model-agnostic, it can be used in other similar scenarios, extending its value beyond passport images to a much broader set of documents.
We performed a cost analysis based on the following assumptions:
Our analysis shows that training cost remains constant and high without active learning. Incorporating active learning results in exponentially decreasing costs with each new batch of data.
We further reduced costs by deploying the inference endpoint as an elastic endpoint by adding an auto scaling policy. The endpoint resources can scale up or down between zero and a configured maximum number of instances.
Our focus was to help the United team meet their functional requirements while building a scalable and flexible cloud application. The ML Solutions Lab team developed the complete production-ready solution with help of AWS CDK, automating management and provisioning of all cloud resources and services. The final cloud application was deployed as a single AWS CloudFormation stack with four nested stacks, each represented a single functional component.
Almost every pipeline feature, including Docker images, endpoint auto scaling policy, and more, was parameterized through Parameter Store. With such flexibility, the same pipeline instance could be run with a broad range of settings, adding the ability to experiment.
In this post, we discussed how United Airlines, in collaboration with the ML Solutions Lab, built an active learning framework on AWS to automate the processing of passenger documents. The solution had great impact on two important aspects of United’s automation goals:
If you are interested in implementing a similar solution or want to learn more about the ML Solutions Lab, contact your account manager or visit us at Amazon Machine Learning Solutions Lab.
Jasper Research Lab’s new shadow generation research and model enable brands to create more photorealistic…
We’re announcing new updates to Gemini 2.0 Flash, plus introducing Gemini 2.0 Flash-Lite and Gemini…
Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response…
This post is co-written with Martin Holste from Trellix. Security teams are dealing with an…
As AI continues to unlock new opportunities for business growth and societal benefits, we’re working…
An internal email obtained by WIRED shows that NOAA workers received orders to pause “ALL…