In today’s rapidly evolving landscape of artificial intelligence (AI), training large language models (LLMs) poses significant challenges. These models often require enormous computational resources and sophisticated infrastructure to handle the vast amounts of data and complex algorithms involved. Without a structured framework, the process can become prohibitively time-consuming, costly, and complex. Enterprises struggle with managing distributed training workloads, efficient resource utilization, and model accuracy and performance. This is where the NVIDIA NeMo Framework comes into play. In this post, we present a step-by-step guide to run distributed training workloads on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster.
NVIDIA NeMo is an end-to-end cloud-centered framework for training and deploying generative AI models with billions and trillions of parameters at scale. The NVIDIA NeMo Framework provides a comprehensive set of tools, scripts, and recipes to support each stage of the LLM journey, from data preparation to training and deployment. It offers a variety of customization techniques and is optimized for at-scale inference of models for both language and image applications, using multi-GPU and multi-node configurations. NVIDIA NeMo simplifies generative AI model development, making it more cost-effective and efficient for enterprises. By providing end-to-end pipelines, advanced parallelism techniques, memory-saving strategies, and distributed checkpointing, NVIDIA NeMo makes sure AI model training is streamlined, scalable, and high-performing.
The following are benefits of using NVIDIA NeMo for distributed training:
You can deploy and manage NVIDIA NeMo using either Slurm or Kubernetes orchestration platforms. Amazon EKS is a managed Kubernetes service that makes it straightforward to run Kubernetes clusters on AWS. It manages the availability and scalability of the Kubernetes control plane, and it provides compute node auto scaling and lifecycle management support to help you run highly available container applications.
Amazon EKS is an ideal platform for running distributed training workloads due to its robust integrations with AWS services and performance features. It seamlessly integrates with Amazon FSx for Lustre, a high-throughput file system, enabling fast data access and management using persistent volume claims with the FSx CSI driver. Amazon EKS also integrates with Amazon CloudWatch for comprehensive logging and monitoring, providing insights into cluster performance and resource utilization. It supports Amazon Simple Storage Service (Amazon S3) for scalable and durable data storage and management, providing accessibility for large datasets. Enhanced network performance is achieved with Elastic Fabric Adapter (EFA), which offers low-latency, high-throughput connectivity between nodes. These features collectively make Amazon EKS a powerful and efficient choice for optimizing AI and machine learning (ML) training workflows.
The following diagram shows the solution architecture.
In this post, we present the steps to run distributed training workloads on an EKS cluster. The high-level steps are as follows:
You need to be able to launch a CPU-based Amazon Elastic Compute Cloud (Amazon EC2) instance that you’ll use to create the EKS cluster. When your instance is up and running, SSH into your EC2 instance and install the following CLIs:
These steps may change if you are on a non-Linux platform. Consult the preceding documentation for installing the CLIs on other platforms accordingly. We also require that you have a capacity reservation with p4de.24xlarge instances and have the capacityReservationID
.
ECR p4de.24xlarge instances have the NVIDIA A100 80GB instances, which are highly popular for distributed training generative AI workloads. For more information, refer to Amazon EC2 Instance Types. In this section, we show how to create an EKS cluster with an On-Demand Capacity Reservation for p4de.24xlarge instances.
The following are key points to note when creating this cluster:
capacityReservationID
field and make sure to specify the availabilityZones
within the managedNodeGroups
section, which should be the same Availability Zone ID in which your capacity lives.c5.2xlarge
instances and another for running distributed training on p4de.24xlarge
instances. Managed node groups will use Amazon EKS optimized AMIs. If you want to provide a custom AMI, you can create a self-managed node group and specify a custom AMI. To find the AMI ID, refer to Retrieving Amazon EKS optimized Amazon Linux AMI IDs. For more details about the Amazon EKS optimized AMI, see the GitHub repo.efaEnabled
is set to true
. You can use the same config for creating a cluster with other node groups. For a list of EFA supported instance types, see Supported instance types.p5.48xlarge
instance with the NVIDIA H100 80 GB GPU. To add a P5 node group to an existing EKS cluster, refer to AWS CLI scripts for EKS management.Next, you can install the AWS EFA Kubernetes Device Plugin. EFA is a network interface for EC2 instances that enhances the performance of inter-node communications, which is critical for distributed training workloads that involve GPUs. This plugin allows Kubernetes to recognize and utilize the EFA device, facilitating high-throughput, low-latency networking necessary for efficient distributed training and deep learning applications.
The NVIDIA device plugin for Kubernetes enables GPU support within your EKS cluster by exposing the GPUs to the Kubernetes API server through the kubelet. It advertises the available GPU resources, allowing Kubernetes to schedule and manage GPU-accelerated workloads.
kubectl get nodes
to verify the nodes.Alternatively, you can use the EKS node viewer tool to view nodes, their costs, and their status in your cluster. After it’s installed, enter eks-node-viewer
to get the following view.
The node viewer displays the IP addresses of our two p4de.24xlarge
compute nodes.
The preceding command describes a lot of detail of the node. To make sure EFA is installed correctly, make sure you see details as shown in the following screenshot.
For p4 nodes, you will see vpc.amazonaws.com/efa:4
and for p5.48xlarge
nodes, you should see vpc.amazonaws.com/efa:32.
If EFA is enabled in the node group, make sure that a security group is attached to the nodes that allows a rule to allow all outgoing traffic originating from the same security group. This is required for EFA to work. For instructions, see Get started with EFA and MPI. This security group is intended for testing purposes only. For your production environments, we recommend that you create an inbound SSH rule that allows traffic only from the IP address from which you are connecting, such as the IP address of your computer, or a range of IP addresses in your local network.
For distributed training applications, typically hundreds of GPU instances are used, with each node containing multiple GPUs. It is crucial that all nodes can access a shared file system to train on the same dataset efficiently. For this purpose, a high-performance file system with high throughput and low latency is essential. We recommend using the FSx for Lustre file system for large-scale distributed training, because it meets these requirements and provides seamless data access for all nodes involved in the training process.
To have a FSx for Lustre file system mounted on your EKS cluster, complete the following steps:
FSX_SUBNET_ID
), VPC of Amazon EKS (VPC_ID
), and the security group you created (SECURITY_GROUP_ID
).Before mounting the file system, you need to install the FSx CSI driver that allows EKS clusters to manage the lifecycle of FSx for Lustre file systems.
You can check to make sure that the volumes are in Bound
state.
For this post, we use the NVIDIA device plugin for Kubernetes, but if you need to install the GPU Operator, you can do so as follows:
To enable distributed training, we use the KubeFlow Training Operator, which is essential for managing and scheduling ML training jobs in a Kubernetes environment. This operator simplifies the process of running distributed training jobs by automating the deployment and scaling of the necessary components. See the following code:
Additionally, we use the KubeFlow MPI Operator for preprocessing training data in parallel. The MPI Operator facilitates running Message Passing Interface (MPI) jobs, which are crucial for parallelizing the preprocessing tasks across multiple nodes, thereby speeding up the training process. See the following code:
The NVIDIA NeMo Framework is available publicly in the image nvcr.io/nvidia/nemo:24.01.framework
. We provide an AWS optimized Dockerfile for use with P4 and P5 instances. We recommend the following library versions for optimal performance:
You can build and push the image to Amazon Elastic Container Registry (Amazon ECR) as follows:
The NVIDIA NeMo Framework requires users to provide config files with job and model information. You can copy the launcher scripts from the container as follows:
In a Slurm cluster implementation, the launcher scripts, data, and results folder could reside in the file system that both the head node (node from where jobs are submitted) and compute nodes access. But in this Amazon EKS implementation, the node that you used to create the EKS cluster doesn’t have access to EKS file system. To get around this, you can put the launcher scripts in the head node and the results and data folder in the file system that the compute nodes have access to.
We’re now ready to set up NVIDIA NeMo Kubernetes manifests for data preparation and model training. For more information about running it on premises, see Running NeMo Framework on Kubernetes. There are some modifications to be done for it to run on Amazon EKS, as shown in the following steps. We provide the launcher scripts in the GitHub repo.
subPath
field is the path where FSx for Lustre is mounted, which is /fsx-shared
in this case. Next, we copy the following folders from the container to the /fsx-shared/data folder:
NeMo-Megatron-Launcher/launcher_scripts/data/bpe
NeMo-Megatron-Launcher/launcher_scripts/data/nsfw
fsx-share-test.yaml
as follows: A few files need to be updated for data preparation for it to work with the EKS cluster.
k8s
.gpt3/126m
.data_preparation
and no other stages.launcher_scripts_path
, use the path to the NeMo Megatron launch scripts, which should end with /launcher_scripts
.data_dir
, use /fsx-shared/data
(the location to store and read the data).base_results_dir
, use /fsx-shared/results
(the location to store the results, checkpoints, and logs).${REPOSITORY}${IMAGE}${TAG}
node_array_size
to 2.file_numbers
to “0-5”. With five files, it should be around 350 GB of datampirun is not found
, add the full path to the executable /opt/amazon/openmpi/bin/mpirun
./fsx-shared
in the container volume mount path.python3 main.py
This script creates a Helm chart for the selected stage (in this case, data_preparation
) and runs the Helm chart automatically. Refer to Run NeMo Framework on Kubernetes for an explanation of the data preparation process. Make sure python3 is installed.
helm list, kubectl get pods, and kubectl logs --follow
).helm uninstall download-gpt3-pile
You can see the downloaded the data in the /fsx-shared
folder by running in one of the pods as kubectl exec -it nlp-worker-0 bash
.
Now that our data preparation is complete, we’re ready to train our model with the created dataset. Complete the following steps:
conf/config.yaml
file: stages
to training
and no other stages.conf/training/gpt3/126m.yaml
: num_nodes
to 2.devices
to 1.use_distributed_sampler
: False
to replace_sampler_ddp
: False
.Optionally, if you want to use a mock dataset instead of real dataset for testing purposes, you can modify the data
section as follows. You are essentially changing data_impl: mmap
to data_impl: mock
and assigning an empty list to data_prefix
.
nemo_launcher/core/k8s_templates/training/training.yaml
file:python3 main.py
to start training and you should see the training pods by running kubectl get pods
as follows: In addition to monitoring your job using helm list, kubectl get pods, and kubectl logs –follow
, you can also SSH into your pod with kubectl exec and use nvidia-smi
to check GPU status.
helm uninstall gpt3-126m
Model checkpoints are saved at /fsx-shared/results/checkpoints
along with other training logs and TensorBoard events. By default, checkpoints are saved at every 2,000 steps. You can modify the conf/training/gpt3/126m.yaml
file to make changes in the training setup.
If deployment fails due to incorrect setup or configuration, complete the following debug steps:
kubectl logs --follow PODNAME and kubectl describe pod PODNAME
.helm uninstall CHARTNAME
.Pods should be spun down after removing the Helm chart.
kubectl get pods
.kubectl delete PODNAME
.Based on the error message, you may find errors from:
kubectl get pods -A
output looks like that shown earlier. If errors exist, try reinstalling Operators and CRDs.It’s important to spin down resources after model training in order to avoid costs associated with running idle instances. To clean up our setup, we must delete the FSx for Lustre file system before deleting the cluster because it’s associated with a subnet in the cluster’s VPC.
Not only will this delete the persistent volume, it will also delete the EFS file system and all the data on the file system will be lost.
This will delete all the existing pods, remove the cluster, and delete the VPC you created in the beginning.
In this post, we demonstrated how to train generative AI models at scale using the NeMo Framework within an EKS cluster. We covered the challenges of training LLMs and how NeMo’s comprehensive tools and optimizations address these challenges, making the process more efficient and cost-effective. With NeMo, you can manage and scale distributed training workloads effectively. This post works with P4de instances. Another popular instance for generative AI distributed training workloads is the p5.48xlarge instance with the NVIDIA H100 80 GB GPU. To add a P5 node group to an existing EKS cluster, refer to AWS CLI scripts for EKS management.
To help you get started, we have published a GitHub repository that provides step-by-step instructions for creating an EKS cluster with P4de instances, mounting an FSx for Lustre file system, and running distributed training workloads with NeMo. This guide empowers you to harness the full potential of NeMo and Amazon EKS for your AI model training needs.
Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved…
Today we are announcing the general availability of Amazon Bedrock Prompt Management, with new features…
The rise of Natural Language Processing (NLP) combined with traditional Structured Query Language (SQL) has…
Through his wealth and cultural influence, Elon Musk undoubtedly strengthened the Trump campaign. WIRED unpacks…
The growing use of artificial intelligence (AI)-based models is placing greater demands on the electronics…
This post is co-written with Steven Craig from Hearst. To maintain their competitive edge, organizations…