architecture conceptual 1024x629 1
If you have searched for an item to buy on amazon.com, you have used Amazon Search services. At Amazon Search, we’re responsible for the search and discovery experience for our customers worldwide. In the background, we index our worldwide catalog of products, deploy highly scalable AWS fleets, and use advanced machine learning (ML) to match relevant and interesting products to every customer’s query.
Our scientists regularly train thousands of ML models to improve the quality of search results. Supporting large-scale experimentation presents its own challenges, especially when it comes to improving the productivity of the scientists training these ML models.
In this post, we share how we built a management system around Amazon SageMaker training jobs, allowing our scientists to fire-and-forget thousands of experiments and be notified when needed. They can now focus on high-value tasks and resolving algorithmic errors, saving 60% of their time.
At Amazon Search, our scientists solve information retrieval problems by experimenting and running numerous ML model training jobs on SageMaker. To keep up with our team’s innovation, our models’ complexity and number of training jobs have increased over time. SageMaker training jobs allow us to reduce the time and cost to train and tune those models at scale, without the need to manage infrastructure.
Like everything in such large-scale ML projects, training jobs can fail due to a variety of factors. This post focuses on capacity shortages and failures due to algorithm errors.
We designed an architecture with a job management system to tolerate and reduce the probability of a job failing due to capacity unavailability or algorithm errors. It allows scientists to fire-and-forget thousands of training jobs, automatically retry them on transient failure, and get notified of success or failure if needed.
In the following solution diagram, we use SageMaker training jobs as the basic unit of our solution. That is, a job represents the end-to-end training of an ML model.
The high-level workflow of this solution is as follows:
New
status in a metadata store.New
jobs from the metadata store, parses their input, and tries to launch SageMaker training jobs for each one. Their status changes to Launched
or Failed
depending on success.Completed
, Failed
, or InProgress
state in the metadata store.Completed
and Failed
jobs to the scientists.Persisting the jobs history in the metadata store also allows our team to conduct trend analysis and monitor project progress.
This job scheduling solution uses loosely coupled serverless components based on AWS Lambda, Amazon DynamoDB, Amazon Simple Notification Service (Amazon SNS), and Amazon EventBridge. This ensures horizontal scalability, allowing our scientists to launch thousands of jobs with minimal operations effort. The following diagram illustrates the serverless architecture.
In the following sections, we go into more detail about each service and its components.
The ease of use and scalability of DynamoDB made it a natural choice to persist the jobs metadata in a DynamoDB table. This solution stores several attributes of jobs submitted by scientists, thereby helping with progress tracking and workflow orchestration. The most important attributes are as follows:
We use three container-based Lambda functions to orchestrate the job workflow:
New
jobs from the DynamoDB table and launches them using the SageMaker CreateTrainingJob command. It retries on transient failures, such as ResourceLimitExceeded
and CapacityError
, to instrument resiliency into the system. It then updates the job status as Launched
or Failed
depending on success.Failed
jobs from the metadata and assesses whether they should be resubmitted or marked as terminally failed. It also publishes notification messages to the scientists when their jobs reach a terminal state.We use EventBridge to run the Launch Jobs and Monitor Jobs Lambda functions on a schedule. For more information, refer to Tutorial: Schedule AWS Lambda functions using EventBridge.
Alternatively, you can use Amazon DynamoDB Streams for the triggers. For more information, see DynamoDB Streams and AWS Lambda triggers.
Our scientists are notified by email using Amazon SNS when their jobs reach a terminal state (Failed
after a maximum number of retries), Completed
, or Stopped
.
In this post, we shared how Amazon Search adds resiliency to ML model training workloads by scheduling them, and retrying them on capacity shortages or algorithm errors. We used Lambda functions in conjunction with a DynamoDB table as a central metadata store to orchestrate the entire workflow.
Such a scheduling system allows scientists to submit their jobs and forget about them. This saves time and allows them to focus on writing better models.
To go further in your learnings, you can visit Awesome SageMaker and find in a single place, all the relevant and up-to-date resources needed for working with SageMaker.
Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of…
This post was co-written with Vishal Singh, Data Engineering Leader at Data & Analytics team…
At Definity, a leading Canadian P&C insurer with a history spanning over 150 years, we…
Don't expect to hear a lot about better framerates and raytracing at the Nvidia GTC…
The team working at the Social Security Administration appears to be among the largest DOGE…
Many companies invest heavily in hiring talent to create the high-performance library code that underpins…