Categories: FAANG

Huntington Bank: Redacting sensitive data from 400M+ documents with AWS

When your document repository contains hundreds of millions of files accumulated over nearly a decade, how do you systematically find and redact sensitive customer data without taking years to complete? This was the challenge facing The Huntington National Bank (Huntington), a top 10 bank in the United States.

Redacting sensitive information at scale

Since 2015, Huntington’s document management system has securely stored hundreds of millions of documents on-premises. In 2025, as part of a proactive compliance initiative, Huntington set out to process the documents in this system and redact sensitive data. These documents come in different formats, so the solution needed flexibility to handle varied file types while delivering the throughput required to process millions of documents quickly.

Original estimates indicated this effort would take years. However, by designing a scalable redaction workflow using Amazon Textract, Amazon SageMaker, AWS Step Functions, and AWS Lambda, Huntington reduced this timeline to months.

Solution overview

Before examining the technical implementation, let’s look at the core requirements Huntington established for this project. If you’re facing a similar large-scale document processing challenge, these requirements can serve as a starting point for your own solution design:

  • Data must be encrypted at rest and in transit.
  • Locations where data is stored or accessed must meet strict access requirements.
  • Services used must be in-scope for PCI DSS compliance.
  • Outputs must be replicated back to on-premises data stores.
  • Redaction accuracy must meet or exceed 95% to meet compliance requirements.

The following diagram illustrates the high-level solution architecture.

Moving data securely, with confidence

Huntington’s first objective was to move documents from an on-premises file share to an Amazon Simple Storage Service (Amazon S3) bucket. Moving documents is straightforward, but this effort required transferring over 400 million documents, encrypted in transit and at rest. To accomplish this, Huntington used AWS DataSync, AWS Direct Connect, Amazon S3, and AWS Key Management Service (AWS KMS).

AWS DataSync can be deployed as an agent in your on-premises data center to monitor a configured source, such as an SMB file share. While getting documents to AWS was critical for processing, AWS DataSync also supports syncing data back to on-premises, which was another key requirement for this project.

Detecting sensitive data using Amazon Textract

Amazon Textract is an AWS machine learning service that extracts text, tables, and forms from scanned documents. Financial institutions use it to automatically process documents like account statements or loan applications, then identify sensitive data such as Social Security numbers, account numbers, and personal addresses. The following sample invoice demonstrates this capability.

Amazon Textract detects various fields from a document and provides coordinates of detected fields and other metadata within a JSON output.

Huntington used Amazon Textract in an orchestrated process with AWS Step Functions. This approach reduced manual review time while improving accuracy in detecting sensitive information across large document volumes.

Scaling detection throughput

Automated pipelines for document processing are valuable, but processing documents sequentially would have extended the project timeline to years. To meet their goal, Huntington needed to process millions of documents each day.

Scaling to this level required addressing two main considerations: maximizing concurrent Amazon Textract jobs within service quotas, and controlling request rates to avoid throttling.

AWS services have quotas that can be adjusted through soft and hard limits. The Amazon Textract jobs-per-second quota can be increased by submitting a request through the AWS Service Quotas console.

To maximize throughput against the service quota, Huntington used the AWS Step Functions built-in map state, which processes collections of inputs in JSON, CSV, or other formats. The team organized documents in Amazon S3 into a JSON collection and ran the map state in distributed mode for higher concurrency. To track pipeline progress, they used AWS Step Functions map run execution summaries alongside Amazon CloudWatch dashboards to monitor response times, throttle counts, successes, and error rates.

To address potential throttling, Huntington monitored their CloudWatch dashboard to verify Amazon Textract successful request counts and throttled counts. As needed, they adjusted concurrency limits for child workflow executions to confirm they remained under the Amazon Textract service quota while maintaining high throughput. When jobs completed successfully, detected fields and metadata were written to a bucket for later review. The following diagram depicts this approach:

The wait block within the step function verified the process was ready to proceed with writing job metadata and continuing with the next Amazon Textract invocation. When there are no failures, the state machine finishes with a pass state. When failures occur, AWS Step Functions writes to a log for human review and reprocessing.

Redacting detected sensitive information

Up to this point, the process focused on detecting sensitive data and cataloging it within metadata files written to Amazon S3. The final steps are to redact the documents and transmit them back to on-premises storage.

Image and PDF redaction is supported by several open-source and proprietary tools. Common open-source Python libraries include PyMuPDF or image drawing libraries like PIL. The following figure shows a sample redaction of the invoice shown earlier. Amazon Textract supports detection of various fields, and you can also create custom classifications using regex patterns. Combined with redaction software, you can confidently redact detected fields. If you want to create a threshold for human intervention, Amazon Textract provides confidence scores that can trigger validation workflows.

Once again, Huntington faced the same architectural challenge: how would this scale? AWS Step Functions provided the solution for processing millions of documents while offering hooks for error handling and retry logic. As the document processing pipeline cataloged objects requiring redaction, Huntington ran a simple flow against them:

To verify accuracy and thoroughness, Huntington double-checked that detected fields matched expected patterns prior to redaction, followed by a metadata update for each file. Redacted files were placed in an Amazon S3 location monitored by AWS DataSync for transmission back to on-premises file storage.

Conclusion

Using AWS, Huntington processed documents at a rate of approximately 10 million per day, reducing estimated processing time from years to just a few months. The cost of processing the entire document repository was approximately 5% of the original estimate. Redaction accuracy exceeded 95%, meeting compliance requirements and supporting data security objectives.

This project demonstrates how AWS services can support large-scale data processing and compliance initiatives. Huntington plans to continue using this framework for high-volume redaction needs such as mergers and acquisitions.

To learn more about the services used in this solution, visit the Amazon Textract detail page or explore the AWS Step Functions documentation.

Acknowledgements

Special thanks to the following individuals and teams for their contributions: Xuelei Yuan, Robert Carnell, Jeanne Keith, Debbie Montgomery, Bill Gross, Jodi Pettiford, Jon Glazer, Marshall Doss, Bob Wojasinski, Tami Wolf, Marijane Eldridge, Pradeep Kumar Tata, Michael Burkhardt, Nirmal Antony, Trevor Pease, Bryan Griffith, Angus Ferguson (AWS) Randy Patrick (AWS), Stephanie Brenneman (AWS), Art Steele, Kevin Owen.


About the authors

Rob Carnell

Rob is the Enterprise Data and Analytics Director at Huntington, overseeing cross-functional teams across AI, modeling, campaign testing and design, insights, and digital to drive integrated solutions and business impact.

Timothy Gorman

Timothy is a Lead AI Engineer at Huntington National Bank specializing in automation and unstructured data processing. He holds a doctorate in physics from The Ohio State University and has worked across disciplines including atomic physics, laser engineering, and AI-driven automation in finance.

Bobby Lumpkin

Bobby is an AI/ML Engineer at Huntington National Bank, specializing in artificial intelligence, machine learning, and advanced statistical methods in financial services. He holds a bachelor’s degree in mathematics and three master’s degrees in mathematics, mathematical sciences, and applied statistics, respectively.

Xuelei Yuan

Xuelei is a Data Science Director at Huntington, where she leads AI and machine learning initiatives, focusing on scalable, production-ready solutions powered by cloud technologies.

Ryan Doty

Ryan is a Solutions Architect Manager at Amazon Web Services (AWS), based out of New York. He helps financial services customers accelerate their adoption of the AWS Cloud by providing architectural guidelines to design innovative and scalable solutions. Coming from a software development and sales engineering background, the possibilities that the cloud can bring to the world excite him.

Angus Ferguson

Angus is a Senior Solutions Architect with the North American Financial Service Industry team at AWS since 2022. In his role, Angus helps his customers to translate business objectives into a technical vision, enabling them to grow and innovate in the cloud. Outside of AWS, Angus also has a passion for cultivating student’s passions through large events, such as hackathons, where he gets to mentor America’s next generation of computer engineers.

Randy Patrick

Randy is a Senior Technical Account Manager with the North American Financial Services Industry team at AWS. With 21 years of IT experience and a focus on cybersecurity, Randy helps enterprise customers build secure, resilient architectures that meet rigorous compliance and data protection requirements.

AI Generated Robotic Content

Recent Posts

Context Windows Are Not Memory: What AI Agent Developers Need to Understand

In this article, you will learn why a large context window is not the same…

2 hours ago

The Skylight Calendar Is One of My Favorite Products On Sale for Prime Day

The Skylight Calendar 2 and Calendar Max are both on sale for Prime Day if…

3 hours ago

Neural-machine interfaces reveal that brain senses hand movement through grasp synergies

A research team led by Sant'Anna School of Advanced Studies in Pisa, in collaboration with…

3 hours ago

KREA 2: Open-Source Release

Hey everyone, We're the team behind Krea, and today we're launching Krea 2, our new…

1 day ago

Clustering Unstructured Text with LLM Embeddings and HDBSCAN

The current era of Generative AI seems to primarily focus on chat interfaces and prompts,…

1 day ago

Build a protein research copilot with Amazon Bedrock AgentCore

Protein researchers face a time-consuming challenge: manually searching through thousands of peptide sequences to find…

1 day ago