Today, tens of thousands of customers are building, training, and deploying machine learning (ML) models using Amazon SageMaker to power applications that have the potential to reinvent their businesses and customer experiences. These ML models have been increasing in size and complexity over the last few years, which has led to state-of-the-art accuracies across a range of tasks and also pushing the time to train from days to weeks. As a result, customers must scale their models across hundreds to thousands of accelerators, which makes them more expensive to train.
SageMaker is a fully managed ML service that helps developers and data scientists easily build, train, and deploy ML models. SageMaker already provides the broadest and deepest choice of compute offerings featuring hardware accelerators for ML training, including G5 (Nvidia A10G) instances and P4d (Nvidia A100) instances.
Growing compute requirements calls for faster and more cost-effective processing power. To further reduce model training times and enable ML practitioners to iterate faster, AWS has been innovating across chips, servers, and data center connectivity. The new Trn1 instances powered by AWS Trainium chips offer the best price-performance and the fastest ML model training on AWS, providing up to 50% lower cost to train deep learning models over comparable GPU-based instances without any drop in accuracy.
In this post, we show how you can maximize your performance and reduce cost using Trn1 instances with SageMaker.
SageMaker training jobs support ml.trn1 instances, powered by Trainium chips, which are purpose built for high-performance ML training applications in the cloud. You can use ml.trn1 instances on SageMaker to train natural language processing (NLP), computer vision, and recommender models across a broad set of applications, such as speech recognition, recommendation, fraud detection, image and video classification, and forecasting. The ml.trn1 instances feature up to 16 Trainium chips, which is a second-generation ML chip built by AWS after AWS Inferentia. ml.trn1 instances are the first Amazon Elastic Compute Cloud (Amazon EC2) instances with up to 800 Gbps of Elastic Fabric Adapter (EFA) network bandwidth. For efficient data and model parallelism, each ml.trn1.32xl instance has 512 GB of high-bandwidth memory, delivers up to 3.4 petaflops of FP16/BF16 compute power, and features NeuronLink, an intra-instance, high-bandwidth, nonblocking interconnect.
Trainium is available in two configurations and can be used in the US East (N. Virginia) and US West (Oregon) Regions.
The following table summarizes the features of the Trn1 instances.
Instance Size | Trainium Accelerators | Accelerator Memory (GB) | vCPUs | Instance Memory (GiB) | Network Bandwidth (Gbps) | EFA and RDMA Support |
trn1.2xlarge | 1 | 32 | 8 | 32 | Up to 12.5 | No |
trn1.32xlarge | 16 | 512 | 128 | 512 | 800 | Yes |
trn1n.32xlarge (coming soon) | 16 | 512 | 128 | 512 | 1600 | Yes |
Let’s understand how to use Trainium with SageMaker with a simple example. We will train a text classification model with SageMaker training and PyTorch using the Hugging Face Transformers Library.
We use the Amazon Reviews dataset, which consists of reviews from amazon.com. The data spans a period of 18 years, comprising approximately 35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. The following code is an example from the AmazonPolarity
test set:
For this post, we only use the content and label fields. The content field is a free text review, and the label field is a binary value containing 1 or 0 for positive or negative reviews, respectively.
For our algorithm, we use BERT, a transformer model pre-trained on a large corpus of English data in a self-supervised fashion. This model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification, or question answering.
Let’s begin by taking a closer look at the different components involved in training the model:
Let’s look at the code changes needed to adopt a regular GPU-based PyTorch script to run on Trainium. At a high level, we need to make the following changes:
The entire example, which trains a text classification model using SageMaker and Trainium, is available in the following GitHub repo. The notebook file Fine tune Transformers for building classification models using SageMaker and Trainium.ipynb is the entrypoint and contains step-by-step instructions to run the training.
In the test, we ran two training jobs: one on ml.trn1.32xlarge, and one on ml.p4d.24xlarge with the same batch size, training data, and other hyperparameters. During the training jobs, we measured the billable time of the SageMaker training jobs, and calculated the price-performance by multiplying the time required to run training jobs in hours by the price per hour for the instance type. We selected the best result for each instance type out of multiple jobs runs.
The following table summarizes our benchmark findings.
Model | Instance Type | Price (per node * hour) | Throughput (iterations/sec) | ValidationAccuracy | Billable Time (sec) | Training Cost in $ |
BERT base classification | ml.trn1.32xlarge | 24.725 | 6.64 | 0.984 | 6033 | 41.47 |
BERT base classification | ml.p4d.24xlarge | 37.69 | 5.44 | 0.984 | 6553 | 68.6 |
The results showed that the Trainium instance costs less than the P4d instance, providing similar throughput and accuracy when training the same model with the same input data and training parameters. This means that the Trainium instance delivers better price-performance than GPU-based P4D instances. With a simple example like this, we can see Trainium offers about 22% faster time to train and up to 50% lower cost over P4d instances.
After we train the model, we can deploy it to various instance types such as CPU, GPU, or AWS Inferentia. The key point to note is the trained model isn’t dependent on specialized hardware to deploy and make inference. SageMaker provides mechanisms to deploy a trained model using both real-time or batch mechanisms. The notebook example in the GitHub repo contains code to deploy the trained model as a real-time endpoint using an ml.c5.xlarge (CPU-based) instance.
In this post, we looked at how to use Trainium and SageMaker to quickly set up and train a classification model that gives up to 50% cost savings without compromising on accuracy. You can use Trainium for a wide range of use cases that involve pre-training or fine-tuning Transformer-based models. For more information about support of various model architectures, refer to Model Architecture Fit Guidelines.
Jasper Research Lab’s new shadow generation research and model enable brands to create more photorealistic…
We’re announcing new updates to Gemini 2.0 Flash, plus introducing Gemini 2.0 Flash-Lite and Gemini…
Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response…
This post is co-written with Martin Holste from Trellix. Security teams are dealing with an…
As AI continues to unlock new opportunities for business growth and societal benefits, we’re working…
An internal email obtained by WIRED shows that NOAA workers received orders to pause “ALL…