In this post, we demonstrate how Kubeflow on AWS (an AWS-specific distribution of Kubeflow) used with AWS Deep Learning Containers and Amazon Elastic File System (Amazon EFS) simplifies collaboration and provides flexibility in training deep learning models at scale on both Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon SageMaker utilizing a hybrid architecture approach.
Machine learning (ML) development relies on complex and continuously evolving open-source frameworks and toolkits, as well as complex and continuously evolving hardware ecosystems. This poses a challenge when scaling out ML development to a cluster. Containers offer a solution, because they can fully encapsulate not just the training code, but the entire dependency stack down to the hardware libraries. This ensures an ML environment that is consistent and portable, and facilitates reproducibility of the training environment on each individual node of the training cluster.
Kubernetes is a widely adopted system for automating infrastructure deployment, resource scaling, and management of these containerized applications. However, Kubernetes wasn’t built with ML in mind, so it can feel counterintuitive to data scientists due to its heavy reliance on YAML specification files. There isn’t a Jupyter experience, and there aren’t many ML-specific capabilities, such as workflow management and pipelines, and other capabilities that ML experts expect, such as hyperparameter tuning, model hosting, and others. Such capabilities can be built, but Kubernetes wasn’t designed to do this as its primary objective.
The open-source community took notice and developed a layer on top of Kubernetes called Kubeflow. Kubeflow aims to make the deployment of end-to-end ML workflows on Kubernetes simple, portable, and scalable. You can use Kubeflow to deploy best-of-breed open-source systems for ML to diverse infrastructures.
Kubeflow and Kubernetes provides flexibility and control to data scientist teams. However, ensuring high utilization of training clusters running at scale with reduced operational overheads is still challenging.
This post demonstrates how customers who have on-premises restrictions or existing Kubernetes investments can address this challenge by using Amazon EKS and Kubeflow on AWS to implement an ML pipeline for distributed training based on a self-managed approach, and use fully managed SageMaker for a cost-optimized, fully managed, and production-scale training infrastructure. This includes step-by-step implementation of a hybrid distributed training architecture that allows you to choose between the two approaches at runtime, conferring maximum control and flexibility with stringent needs for your deployments. You will see how you can continue using open-source libraries in your deep learning training script and still make it compatible to run on both Kubernetes and SageMaker in a platform agnostic way.
Neural network models built with deep learning frameworks like TensorFlow, PyTorch, MXNet, and others provide much higher accuracy by using significantly larger training datasets, especially in computer vision and natural language processing use cases. However, with large training datasets, it takes longer to train the deep learning models, which ultimately slows down the time to market. If we could scale out a cluster and bring down the model training time from weeks to days or hours, it could have a huge impact on productivity and business velocity.
Amazon EKS helps provision the managed Kubernetes control plane. You can use Amazon EKS to create large-scale training clusters with CPU and GPU instances and use the Kubeflow toolkit to provide ML-friendly, open-source tools and operationalize ML workflows that are portable and scalable using Kubeflow Pipelines to improve your team’s productivity and reduce the time to market.
However, there could be a couple of challenges with this approach:
Kubeflow on AWS helps address these challenges and provides an enterprise-grade semi-managed Kubeflow product. With Kubeflow on AWS, you can replace some Kubeflow control plane services like database, storage, monitoring, and user management with AWS managed services like Amazon Relational Database Service (Amazon RDS), Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), Amazon FSx, Amazon CloudWatch, and Amazon Cognito.
Replacing these Kubeflow components decouples critical parts of the Kubeflow control plane from Kubernetes, providing a secure, scalable, resilient, and cost-optimized design. This approach also frees up storage and compute resources from the EKS data plane, which may be needed by applications such as distributed model training or user notebook servers. Kubeflow on AWS also provides native integration of Jupyter notebooks with Deep Learning Container (DLC) images, which are pre-packaged and preconfigured with AWS optimized deep learning frameworks such as PyTorch and TensorFlow that allow you to start writing your training code right away without dealing with dependency resolutions and framework optimizations. Also, Amazon EFS integration with training clusters and the development environment allows you to share your code and processed training dataset, which avoids building the container image and loading huge datasets after every code change. These integrations with Kubeflow on AWS help you speed up the model building and training time and allow for better collaboration with easier data and code sharing.
Kubeflow on AWS helps build a highly available and robust ML platform. This platform provides flexibility to build and train deep learning models and provides access to many open-source toolkits, insights into logs, and interactive debugging for experimentation. However, achieving maximum utilization of infrastructure resources while training deep learning models on hundreds of GPUs still involves a lot of operational overheads. This could be addressed by using SageMaker, which is a fully managed service designed and optimized for handling performant and cost-optimized training clusters that are only provisioned when requested, scaled as needed, and shut down automatically when jobs complete, thereby providing close to 100% resource utilization. You can integrate SageMaker with Kubeflow Pipelines using managed SageMaker components. This allows you to operationalize ML workflows as part of Kubeflow pipelines, where you can use Kubernetes for local training and SageMaker for product-scale training in a hybrid architecture.
The following architecture describes how we use Kubeflow Pipelines to build and deploy portable and scalable end-to-end ML workflows to conditionally run distributed training on Kubernetes using Kubeflow training or SageMaker based on the runtime parameter.
Kubeflow training is a group of Kubernetes Operators that add to Kubeflow the support for distributed training of ML models using different frameworks like TensorFlow, PyTorch, and others. pytorch-operator
is the Kubeflow implementation of the Kubernetes custom resource (PyTorchJob) to run distributed PyTorch training jobs on Kubernetes.
We use the PyTorchJob Launcher component as part of the Kubeflow pipeline to run PyTorch distributed training during the experimentation phase when we need flexibility and access to all the underlying resources for interactive debugging and analysis.
We also use SageMaker components for Kubeflow Pipelines to run our model training at production scale. This allows us to take advantage of powerful SageMaker features such as fully managed services, distributed training jobs with maximum GPU utilization, and cost-effective training through Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances.
As part for the workflow creation process, you complete the following steps (as shown in the preceding diagram) to create this pipeline:
The following figure shows the Kubeflow Pipelines components involved in the architecture that give us the flexibility to choose between Kubernetes or SageMaker distributed environments.
We use the following step-by-step approach to install and run the use case for distributed training using Amazon EKS and SageMaker using Kubeflow on AWS.
For this walkthrough, you should have the following prerequisites:
sagemakerrole
. Add managed policies AmazonSageMakerFullAccess
and AmazonS3FullAccess
to give SageMaker access to S3 buckets. This role is used by SageMaker job submitted as part of Kubeflow Pipelines step.ml.p3.2xlarge
increased to 2 using Service Quotas ConsoleYou can use several different approaches to build a Kubernetes cluster and deploy Kubeflow. In this post, we focus on an approach that we believe brings simplicity to the process. First, we create an EKS cluster, then we deploy Kubeflow on AWS v1.5 on it. For each of these tasks, we use a corresponding open-source project that follows the principles of the Do Framework. Rather than installing a set of prerequisites for each task, we build Docker containers that have all the necessary tools and perform the tasks from within the containers.
We use the Do Framework in this post, which automates the Kubeflow deployment with Amazon EFS as an add-on. For the official Kubeflow on AWS deployment options for production deployments, refer to Deployment.
We configure a working directory so we can refer to it as the starting point for the steps that follow:
We also configure an AWS CLI profile. To do so, you need an access key ID and secret access key of an AWS Identity and Access Management (IAM) user account with administrative privileges (attach the existing managed policy) and programmatic access. See the following code:
If you already have an EKS cluster available, you can skip to the next section. For this post, we use the aws-do-eks project to create our cluster.
aws-do-eks
container: The build.sh
script creates a Docker container image that has all the necessary tools and scripts for provisioning and operation of EKS clusters. The run.sh
script starts a container using the created Docker image and keeps it up, so we can use it as our EKS management environment. To see the status of your aws-do-eks
container, you can run ./status.sh
. If the container is in Exited status, you can use the ./start.sh
script to bring the container up, or to restart the container, you can run ./stop.sh
followed by ./run.sh
.
aws-do-eks
container: By default, this configuration creates a cluster named eks-kubeflow
in the us-west-2
Region with six m5.xlarge nodes. Also, EBS volumes encryption is not enabled by default. You can enable it by adding "volumeEncrypted: true"
to the nodegroup and it will encrypt using the default key. Modify other configurations settings if needed.
The cluster provisioning process may take up to 30 minutes.
The output from the preceding command for a cluster that was created successfully looks like the following code:
In this use case, you speed up the SageMaker training job by training deep learning models from data already stored in Amazon EFS. This choice has the benefit of directly launching your training jobs from the data in Amazon EFS with no data movement required, resulting in faster training start times.
We create an EFS volume and deploy the EFS Container Storage Interface (CSI) driver. This is accomplished by a deployment script located in /eks/deployment/csi/efs
within the aws-do-eks
container.
This script assumes you have one EKS cluster in your account. Set CLUSTER_NAME=<eks_cluster_name>
in case you have more than one EKS cluster.
This script provisions an EFS volume and creates mount targets for the subnets of the cluster VPC. It then deploys the EFS CSI driver and creates the efs-sc
storage class and efs-pv
persistent volume in the EKS cluster.
Upon successful completion of the script, you should see output like the following:
You use a private VPC that your SageMaker training job and EFS file system have access to. To give the SageMaker training cluster access to the S3 buckets from your private VPC, you create a VPC endpoint:
You may now exit the aws-do-eks
container shell and proceed to the next section:
To deploy Kubeflow on Amazon EKS, we use the aws-do-kubeflow project.
This script opens the project configuration file in a text editor. It’s important for AWS_REGION to be set to the Region your cluster is in, as well as AWS_CLUSTER_NAME to match the name of the cluster that you created earlier. By default, your configuration is already properly set, so if you don’t need to make any changes, just close the editor.
The build.sh
script creates a Docker container image that has all the tools necessary to deploy and manage Kubeflow on an existing Kubernetes cluster. The run.sh
script starts a container, using the Docker image, and the exec.sh script opens a command shell into the container, which we can use as our Kubeflow management environment. You can use the ./status.sh
script to see if the aws-do-kubeflow
container is up and running and the ./stop.sh
and ./run.sh
scripts to restart it as needed.
aws-do-eks
container, you can verify that the configured cluster context is as expected: deploy.sh
script: The deployment is successful when all pods in the kubeflow namespace enter the Running state. A typical output looks like the following code:
You should see output that looks like the following code:
This command port-forwards the Istio ingress gateway service from your cluster to your local port 8080. To access the Kubeflow dashboard, visit http://localhost:8080 and log in using the default user credentials (user@example.com/12341234). If you’re running the aws-do-kubeflow
container in AWS Cloud9, then you can choose Preview, then choose Preview Running Application. If you’re running on Docker Desktop, you may need to run the ./kubeflow-expose.sh
script outside of the aws-do-kubeflow
container.
To set up your Kubeflow on AWS environment, we create an EFS volume and a Jupyter notebook.
To create an EFS volume, complete the following steps:
efs-sc-claim
.10
.To create a new notebook, complete the following steps:
aws-hybrid-nb
.c9e4w0g3/notebook-servers/jupyter-pytorch:1.11.0-cpu-py38-ubuntu20.04-e3-v1.1
(the latest available jupyter-pytorch DLC image).1
.5
.efs-sc-claim
./home/jovyan/efs-sc-claim
.efs-sc-claim
in your Jupyter lab interface. You save the training dataset and training code to this folder so the training clusters can access it without needing to rebuild the container images for testing.https://github.com/aws-samples/aws-do-kubeflow
.After you set up the Jupyter notebook, you can run the entire demo using the following high-level steps from the folder aws-do-kubeflow/workshop
in the cloned repository:
0_initialize_dependencies.ipynb
to initialize all dependencies. (Refer 3.2 for details)1_submit_pytorchdist_k8s.ipynb
to create and submit distributed training on one primary and two worker containers using the Kubernetes custom resource PyTorchJob YAML file using Python code. (Refer 3.3 for details)2_create_pipeline_k8s_sagemaker.ipynb
to create the hybrid Kubeflow pipeline that runs distributed training on the either SageMaker or Amazon EKS using the runtime variable training_runtime
. (Refer 3.4 for details)Make sure you ran the notebook 1_submit_pytorchdist_k8s.ipynb
before you start notebook 2_create_pipeline_k8s_sagemaker.ipynb
.
In the subsequent sections, we discuss each of these steps in detail.
As part of the distributed training, we train a classification model created by a simple convolutional neural network that operates on the CIFAR10 dataset. The training script cifar10-distributed-gpu-final.py
contains only the open-source libraries and is compatible to run both on Kubernetes and SageMaker training clusters on either GPU devices or CPU instances. Let’s look at a few important aspects of the training script before we run our notebook examples.
We use the torch.distributed
module, which contains PyTorch support and communication primitives for multi-process parallelism across nodes in the cluster:
We create a simple image classification model using a combination of convolutional, max pooling, and linear layers to which a relu activation function is applied in the forward pass of the model training:
We use the torch DataLoader that combines the dataset and DistributedSampler
(loads a subset of data in a distributed manner using torch.nn.parallel.DistributedDataParallel
) and provides a single-process or multi-process iterator over the data:
If the training cluster has GPUs, the script runs the training on CUDA devices and the device variable holds the default CUDA device:
Before you run distributed training using PyTorch DistributedDataParallel
to run distributed processing on multiple nodes, you need to initialize the distributed environment by calling init_process_group
. This is initialized on each machine of the training cluster.
We instantiate the classifier model and copy over the model to the target device. If distributed training is enabled to run on multiple nodes, the DistributedDataParallel
class is used as a wrapper object around the model object, which allows synchronous distributed training across multiple machines. The input data is split on the batch dimension and a replica of model is placed on each machine and each device.
You will install all necessary libraries to run the PyTorch distributed training example. This includes Kubeflow Pipelines SDK, Training Operator Python SDK, Python client for Kubernetes and Amazon SageMaker Python SDK.
The notebook 1_submit_pytorchdist_k8s.ipynb
creates the Kubernetes custom resource PyTorchJob YAML file using Kubeflow training and the Kubernetes client Python SDK. The following are a few important snippets from this notebook.
We create the PyTorchJob YAML with the primary and worker containers as shown in the following code:
This is submitted to the Kubernetes control plane using PyTorchJobClient
:
You can view the training logs either from the same Jupyter notebook using Python code or from the Kubernetes client shell.
log_type
parameter value to view the primary, worker, or all logs: We set world size – 3 because we’re distributing the training to three processes running in one primary and two worker pods. Data is split at the batch dimension and a third of the data is processed by the model in each container.
The notebook 2_create_pipeline_k8s_sagemaker.ipynb
creates a hybrid Kubeflow pipeline based on conditional runtime variable training_runtime
, as shown in the following code. The notebook uses the Kubeflow Pipelines SDK and it’s provided a set of Python packages to specify and run the ML workflow pipelines. As part of this SDK, we use the following packages:
dsl.pipeline
, which decorates the Python functions to return a pipelinedsl.Condition
package, which represents a group of operations that are only run when a certain condition is met, such as checking the training_runtime
value as sagemaker
or kubernetes
See the following code:
We configure SageMaker distributed training using two ml.p3.2xlarge instances.
After the pipeline is defined, you can compile the pipeline to an Argo YAML specification using the Kubeflow Pipelines SDK’s kfp.compiler
package. You can run this pipeline using the Kubeflow Pipeline SDK client, which calls the Pipelines service endpoint and passes in appropriate authentication headers right from the notebook. See the following code:
If you get a sagemaker import
error, run !pip install sagemaker and restart the kernel (on the Kernel menu, choose Restart Kernel).
Choose the Run details link under the last cell to view the Kubeflow pipeline.
Repeat the pipeline creation step with training_runtime='kubernetes'
to test the pipeline run on a Kubernetes environment. The training_runtime
variable can also be passed in your CI/CD pipeline in a production scenario.
The following screenshot shows our pipeline details for the SageMaker component.
Choose the training job step and on the Logs tab, choose the CloudWatch logs link to access the SageMaker logs.
The following screenshot shows the CloudWatch logs for each of the two ml.p3.2xlarge instances.
Choose any of the groups to see the logs.
The following screenshot shows the pipeline details for our Kubeflow component.
Run the following commands using Kubectl
on your Kubernetes client shell connected to the Kubernetes cluster to see the logs (substitute your namespace and pod names):
To clean up all the resources we created in the account, we need to remove them in reverse order.
./kubeflow-remove.sh
in the aws-do-kubeflow
container. The first set of commands are optional and can be used in case you don’t already have a command shell into your aws-do-kubeflow
container open. aws-do-eks
container folder, remove the EFS volume. The first set of commands is optional and can be used in case you don’t already have a command shell into your aws-do-eks
container open. Deleting Amazon EFS is necessary in order to release the network interface associated with the VPC we created for our cluster. Note that deleting the EFS volume destroys any data that is stored on it.
aws-do-eks
container, run the eks-delete.sh
script to delete the cluster and any other resources associated with it, including the VPC: In this post, we discussed some of the typical challenges of distributed model training and ML workflows. We provided an overview of the Kubeflow on AWS distribution and shared two open-source projects (aws-do-eks and aws-do-kubeflow) that simplify provisioning the infrastructure and the deployment of Kubeflow on it. Finally, we described and demonstrated a hybrid architecture that enables workloads to transition seamlessly between running on a self-managed Kubernetes and fully managed SageMaker infrastructure. We encourage you to use this hybrid architecture for your own use cases.
You can follow the AWS Labs repository to track all AWS contributions to Kubeflow. You can also find us on the Kubeflow #AWS Slack Channel; your feedback there will help us prioritize the next features to contribute to the Kubeflow project.
Special thanks to Sree Arasanagatta (Software Development Manager AWS ML) and Suraj Kota (Software Dev Engineer) for their support to the launch of this post.
Jasper Research Lab’s new shadow generation research and model enable brands to create more photorealistic…
We’re announcing new updates to Gemini 2.0 Flash, plus introducing Gemini 2.0 Flash-Lite and Gemini…
Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response…
This post is co-written with Martin Holste from Trellix. Security teams are dealing with an…
As AI continues to unlock new opportunities for business growth and societal benefits, we’re working…
An internal email obtained by WIRED shows that NOAA workers received orders to pause “ALL…