Amazon SageMaker Training Managed Warm Pools gives you the flexibility to opt in to reuse and hold on to the underlying infrastructure for a user-defined period of time. This is done while also maintaining the benefit of passing the undifferentiated heavy lifting of managing compute instances in to Amazon SageMaker Model Training. In this post, we outline the key benefits and pain points addressed by SageMaker Training Managed Warm Pools, as well as benchmarks and best practices.
SageMaker Model Training is a fully managed capability that spins up instances for every job, trains a model, runs and then spins down instances after the job. You’re only billed for the duration of the job down to the second. This fully managed capability gives you the freedom to focus on your machine learning (ML) algorithm and not worry about undifferentiated heavy lifting like infrastructure management while training your models.
This mechanism necessitates a finite startup time for a training job. Although this startup time, also known as cold-start startup time, is fairly low, some of our most demanding customer use cases require even lower startup times, such as under 20 seconds. There are two prominent use cases that have these requirements:
For such use cases, every second spent on overhead, like the startup time for a training job, has a cumulative effect on all these jobs.
With SageMaker Training Managed Warm Pools, data scientists and ML engineers have the ability to opt in to keep SageMaker training instances or multi-instance clusters warm for a prespecified and reconfigurable time (keep_alive_period_in_seconds
) after each training job completes. So even though you incur a cold-start penalty for the first training job run on an instance or cluster, for all the subsequent training jobs, the instances are already up and running. As a result, these subsequent training jobs that start on an instance before the keep_alive_period_in_seconds
expires don’t incur the cold-start startup time overhead. This can reduce training job startup times to roughly less than 20 seconds (P90).
Data scientists and ML engineers can use SageMaker Training Managed Warm Pools to keep single or multiple instances warm in between training runs for experimentation or run multiple jobs consecutively on the same single or multi-instance cluster. You only pay for the duration of training jobs and the reconfigurable keep_alive_period_in_seconds
like everywhere else you specify for every single instance.
In essence, with SageMaker Training Managed Warm Pools, you get a combination of SageMaker managed instance utilization with the ability to opt in and provision capacity and self-manage utilization for short intervals of time. These intervals are configurable before a job, but if during the keep_alive_period_in_seconds
interval, you need to reduce or increase it, you can do so. Increases to keep_alive_period_in_seconds
can be done in intervals of up to 60 minutes, with a max period for an instance or cluster being 7 days.
To get started with warm pools, first request a warm pool quota limit increase, then specify the keep_alive_period_in_seconds
parameter when starting a training job.
We performed benchmarking tests to measure job startup latency using a 1.34 GB TensorFlow image, 2 GB of data, and different training data input modes (Amazon FSx, Fast File Mode, File Mode). The tests were run across a variety of instance types from the m4, c4, m5, and c5 families in the us-east-2 Region. The startup latency was measured as the time of job creation to the start of the actual training job on the instances. The first jobs that started the cluster and created the warm pool had a startup latency of 2–3 minutes. This higher latency is due to the time taken to provision the infrastructure, download the image, and download the data. The consequent jobs that utilized the warm pool cluster had a startup latency of approximately 20 seconds for Fast File Mode (FFM) or Amazon FSx, and 70 seconds for File Mode (FM). This delta is a result of FM requiring the entire dataset to be downloaded from Amazon S3 prior to the start of the job.
Your choice of training data input mode affects the startup time, even with Warm Pools. Guidance on what input mode to select is in the best practices section later in this post.
The following table summarizes the job startup latency P90 for different training data input modes.
Data Input Mode | Startup Latency P90 (seconds) | |
First Job | Warm Pool Jobs (second job onwards) | |
FSx | 136 | 19 |
Fast File Mode | 143 | 21 |
File Mode | 176 | 70 |
In the following section, we share some best practices when using warm pools.
Warm pools are recommended in the following scenarios:
Warm pools are not recommended when it’s unlikely that someone will reuse the warm pool before it expires. For example, a single lengthy job that runs via an automated ML pipeline.
Training jobs that reuse a warm pool start faster than the first job that created the warm pool. This is due to keeping the ML instances running between jobs with a cached training container Docker image to skip pulling the container from Amazon Elastic Container Registry (Amazon ECR). However, even when reusing a warm pool, certain initialization steps occur for all jobs. Optimizing these steps can reduce your job startup time (both first and subsequent jobs). Consider the following:
When working with a large team of data scientists, you can share warm pools that have matching job criteria, such as the same AWS Identity and Access Management (IAM) role or container image.
Let’s look at an example timeline. User-1 starts a training job that completes and results in a new warm pool created. When user-2 starts a training job, the job will reuse the existing warm pool, resulting in a fast job startup. While user-2’s job is running with the warm pool in use, if another user starts a training job, then a second warm pool will be created.
This reuse behavior helps reduce costs by sharing warm pools between users that start similar jobs. If you want to avoid sharing warm pools between users, then users’ jobs must not have matching job criteria (for example, they must use a different IAM role).
When using warm pools for experimentation, we recommend notifying users when their job is complete. This allows users to resume experimentation before the warm pool expires or stop the warm pool if it’s no longer needed. You can also automatically trigger notifications through Amazon EventBridge.
With warm pools, you can start a job in less than 20 seconds. Some scenarios require real-time, hands-on interactive experimentation and troubleshooting. The open-source SageMaker SSH Helper library allows you to shell into a SageMaker training container and conduct remote development and debugging.
With SageMaker Training Managed Warm Pools, you can keep your model training hardware instances warm after every job for a specified period. This can reduce the startup latency for a model training job by up to 8x. SageMaker Training Managed Warm Pools are available in all public AWS Regions where SageMaker Model Training is available.
To get started, see Train Using SageMaker Managed Warm Pools.
Podcasts are a fun and easy way to learn about machine learning.
TL;DR We asked o1 to share its thoughts on our recent LNM/LMM post. https://www.artificial-intelligence.show/the-ai-podcast/o1s-thoughts-on-lnms-and-lmms What…
Palantir and Grafana Labs’ Strategic PartnershipIntroductionIn today’s rapidly evolving technological landscape, government agencies face the…
Amazon SageMaker Pipelines includes features that allow you to streamline and automate machine learning (ML)…
When it comes to AI, large language models (LLMs) and machine learning (ML) are taking…
Cohere's Command R7B uses RAG, features a context length of 128K, supports 23 languages and…