We are thrilled to introduce Amazon Elastic Kubernetes Service (Amazon EKS) support in Amazon SageMaker HyperPod, a purpose-built infrastructure engineered with resilience at its core. This capability allows for the seamless addition of SageMaker HyperPod managed compute to EKS clusters, using automated node and job resiliency features for foundation model (FM) development.
FMs are typically trained on large-scale compute clusters with hundreds or thousands of accelerators. Under such circumstances, hardware failures pose a significant challenge, because a single accelerator failure among thousands can halt the entire training process. For example, Meta Llama 3 405B pre-training over 54 days on 16K NVIDIA H100 Tensor Core GPUs experienced 419 unexpected interruptions, with 78% attributed to confirmed or suspected hardware issues, and with 58.7% of these interruptions being GPU-related problems, including NVLink failures and HBM3 memory failures.
Since its inception, SageMaker HyperPod was designed with a focus on managed resiliency features to mitigate such hardware failures, enabling FM builders such as Thomson Reuters, Perplexity AI, and Hugging Face to scale their FM training and inference on Slurm clusters. With the EKS support in HyperPod, you can now also benefit from the resiliency features on Kubernetes clusters by managing machine learning (ML) workloads using the HyperPod compute and managed Kubernetes control plane on the EKS cluster.
AI startups like Observea and Articul8, and enterprises like Thomson Reuters use this new feature set to manage their ML model development lifecycle:
“Through our use of SageMaker HyperPod, our customers and internal teams no longer have to worry about operating and configuring the Kubernetes control plane, and SageMaker HyperPod provides the network performance and optimized configurations to support complex HPC workloads. With Amazon EKS support in SageMaker HyperPod, we can reduce time we spent for undifferentiated heavy lifting in infrastructure management and reduce operational costs by over 30%.”
– Observea
“As a Kubernetes house, we are now thrilled to welcome the launch of Amazon EKS support for SageMaker HyperPod. This is a game changer for us as it integrates seamlessly with our existing training pipelines and makes it even easier for us to manage and operate our large-scale Kubernetes clusters. In addition, this also helps our end customers as we are now able to package and productize this capability into our GenAI platform, enabling our customers to run their own training and fine-tuning workloads in a more streamlined manner.”
– Articul8 AI
This post is designed for Kubernetes cluster administrators and ML scientists, providing an overview of the key features that SageMaker HyperPod introduces to facilitate large-scale model training on an EKS cluster.
The post is organized into the following three sections:
This section provides a high-level overview of Amazon EKS support in SageMaker HyperPod, introduces three key resiliency features HyperPod compute provides on the EKS cluster, and discusses how SageMaker HyperPod provides smooth user experiences for admins and scientists.
Amazon EKS support in HyperPod supports a 1-to-1 mapping between an EKS cluster (serving as a Kubernetes control plane) and a HyperPod compute (attached as a group of worker nodes). You have three virtual private clouds (VPCs) in this architecture, hosting different types of resources:
Cross-account ENIs also bridge communication between HyperPod compute instances and other AWS services on your account, such as Amazon Elastic Container Registry (Amazon ECR) and Amazon CloudWatch.
The following diagram illustrates the high-level architecture of Amazon EKS support in HyperPod.
Amazon EKS support in HyperPod provides the following three capabilities to make sure the cluster stays healthy and training jobs continue under unexpected interruptions:
In addition to the aforementioned managed resiliency features, SageMaker HyperPod provides smooth user experiences for both admins and scientists that are critical for managing a large cluster and running large-scale training jobs on them as part of the Amazon EKS integration:
.yaml
file and manage jobs (list, describe, view, cancel) without needing to use kubectl
. Scientists can use open source tools like Kueue (a Kubernetes tool for job queuing) and adjacent SageMaker capabilities like managed MLflow to manage their experiments and training runs. Scientists can also access native SageMaker distributed training libraries that provide performance improvements by up to 20%. You can also enable SageMaker HyperPod compute with Amazon EKS support using third-party tools like KubeRay, which runs on the Kubernetes API. This allows you to bring your preferred job submission and management capabilities used with other Kubernetes clusters into your HyperPod environment.In this section, we provide a detailed guide on integrating HyperPod managed compute into your EKS cluster as Kubernetes worker nodes, and discuss how its built-in resiliency features provide infrastructure stability.
You need to have the following in place prior to the HyperPod compute deployment:
With the aforementioned resources successfully deployed, you’re now prepared to create the HyperPod compute. The cluster configuration is specified using a JSON file; the following code provides an example:
The provided configuration file contains two key highlights:
You can create a HyperPod compute with the following aws command (you need version 2.17.47 or newer):
To verify the cluster status, you can use the following command:
This command displays the cluster details, including the cluster name, status, and creation time:
Alternatively, you can verify the cluster status through the SageMaker console. After a brief period, you can observe that the status for all nodes transitions to Running.
To gain further insight into the instances, you can use kubectl get nodes
and examine the node labels. The sagemaker.amazonaws.com/node-health-status
label reveals the life stage of each node. For instance, nodes with the ml.m5.2xlarge
instance type are labeled as Schedulable, indicating that they have successfully passed the regular HyperPod health check. Conversely, nodes with the ml.p5.48xlarge
instance type are labeled as Unschedulable, indicating that they have entered the initial deep health checks. The following code shows an example:
The deep health check logs are stored in the CloudWatch log group at /aws/sagemaker/Clusters/<cluster_name>/<cluster_id>
. The log streams are logged at DeepHealthCheckResults/<log_stream_id>
. When the deep health checks identify an issue, the output log provides detailed information, including the instance ID that failed the deep health checks and the specific failure reason. For example:
You can check the progress of the deep health check with the following values for the sagemaker.amazonaws.com/deep-health-check
label on each node:
amazonaws.com/deep-health-check: InProgress
amazonaws.com/deep-health-check: Passed
amazonaws.com/deep-health-check: Failed
If a node fails the deep health checks, it will be replaced. Otherwise, it will be marked with the Schedulable label:
sagemaker.amazonaws.com/node-health-status: Schedulable
When you want to manually replace a specific node in your cluster, you can do so by manually modifying the label.
For complete list of resilience-related Kubernetes labels, please refer AWS documentation.
Even after the initial deep health checks, HyperPod periodically runs regular health checks. To view the health events detected by the HyperPod health monitoring agent, you can check the CloudWatch stream log:
/aws/sagemaker/Clusters/<cluster_name>/<cluster_id>
SagemakerHealthMonitoringAgent/<your_node_group_name>/<instance_id>
The SagemakerHealthMonitoringAgent log stream for each node contains only the detection events from the health monitoring agent. For example:
The deep health checks or the health monitor agent identify issues in a certain node, the node is labeled with sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplace:NoSchedule
to avoid scheduling pods, and then the node is replaced or rebooted.
You can monitor the health status of HyperPod nodes through CloudWatch Container Insights, now with enhanced observability for Amazon EKS. Container Insights helps collect, aggregate, and summarize metrics and logs from containerized applications and microservices, providing detailed insights into performance, health, and status metrics for CPU, GPU, Trainium, EFA, and file system up to the container level. For the complete list of metrics tracked, see Amazon EKS and Kubernetes Container Insights metrics. With the Container Insights integration with SageMaker HyperPod, you can also check the individual node health status and the total number of schedulable and unschedulable nodes, as shown in the following screenshots.
You can find the Container Insights set up guide in Amazon EKS Support in Amazon SageMaker HyperPod Workshop.
In addition to infrastructure resiliency features, you can use the use job auto resume capability using the Kubeflow Training Operator for PyTorch to maintain the recovery and continuation of training jobs in the event of interruptions or failures. The job auto resume feature attempts to continue the job, whereas the HyperPod node auto recovery functionality works on resolving node failures (node reboot or replacement as needed) to minimize training downtime. This section demonstrates the job auto resume feature using a PyTorch FSDP example on the awsome-distributed-training repository.
To enable the job auto resume feature, you create a PyTorchJob with the fsdp.yaml manifest, which includes the following annotations
and nodeSelector
:
With the annotations sagemaker.amazonaws.com/enable-job-auto-resume: "true"
and sagemaker.amazonaws.com/job-max-retry-count: "2"
, SageMaker HyperPod resumes interrupted training jobs up to two times and schedules the resumed jobs onto healthy nodes. These healthy nodes are identified by the node selector label sagemaker.amazonaws.com/node-health-status: Schedulable
, ensuring that only nodes that have passed basic health checks and are available for running workloads are used for resumed jobs.
Submit the PyTorchJob using the kubectl
command:
With the job auto resume feature enabled, if a job fails due to a hardware failure or any transient issues during training, SageMaker HyperPod initiates the node replacement workflow and restarts the job after the faulty nodes are replaced. You can verify the status of job auto resume by describing the PyTorchJob:
In the event of a hardware failure, the Kubeflow training job restarts as follows:
When you submit a training job with the HyperPod CLI, you can also request the job to be auto resumed in the following way:
Refer to config.yaml for full configuration. For other CLI options, refer to the documentation on Github repository.
To delete your SageMaker HyperPod compute, use either the SageMaker console or the following AWS Command Line Interface (AWS CLI) command:
Cluster deletion can take a few minutes. You can confirm successful deletion after you see no clusters on the SageMaker console.
With the support for Amazon EKS in SageMaker HyperPod, customers who have standardized their FM development workflows on Kubernetes can adopt SageMaker HyperPod and manage their cluster resources using a familiar Kubernetes interface in SageMaker HyperPod. When training an FM, SageMaker HyperPod automatically monitors cluster health, and when an infrastructure fault such as a GPU failure occurs, SageMaker HyperPod automatically remediates the issue and restarts the training process from the last saved checkpoint, without any human intervention. Amazon EKS further enhances this capability by running deep health checks. Whenever a new instance is added to the SageMaker HyperPod compute, it undergoes a deep health check process to identify and replace potentially problematic instances. SageMaker HyperPod then automatically replaces or reboots nodes identified as faulty and resumes training processes in the event of unexpected interruptions, involving node replacement and job resubmission.
For an end-to-end tutorial on cluster management and FM training, visit the Amazon EKS Support in Amazon SageMaker HyperPod Workshop. For more information on infrastructure deployment and additional distributed training test cases, refer to the awsome-distributed-training repository. If you’re interested in deploying HyperPod with step-by-step commands, you can start from the aws-do-hyperpod repository.
This post is co-written with Steven Craig from Hearst. To maintain their competitive edge, organizations…
Conspiracy theories about missing votes—which are not, in fact, missing—and something being “not right” are…
Researchers have developed AI-driven mobile robots that can carry out chemical synthesis research with extraordinary…
In recent years, roboticists have introduced robotic systems that can complete missions in various environments,…
Overwhelmed by manual tasks and data overload? Streamline your business and boost revenue with the…
In real life, the machine learning model is not a standalone object that only produces…