Kubernetes is a popular orchestration platform for managing containers. Its scalability and load-balancing capabilities make it ideal for handling the variable workloads typical of machine learning (ML) applications. DevOps engineers often use Kubernetes to manage and scale ML applications, but before an ML model is available, it must be trained and evaluated and, if the quality of the obtained model is satisfactory, uploaded to a model registry.
Amazon SageMaker provides capabilities to remove the undifferentiated heavy lifting of building and deploying ML models. SageMaker simplifies the process of managing dependencies, container images, auto scaling, and monitoring. Specifically for the model building stage, Amazon SageMaker Pipelines automates the process by managing the infrastructure and resources needed to process data, train models, and run evaluation tests.
A challenge for DevOps engineers is the additional complexity that comes from using Kubernetes to manage the deployment stage while resorting to other tools (such as the AWS SDK or AWS CloudFormation) to manage the model building pipeline. One alternative to simplify this process is to use AWS Controllers for Kubernetes (ACK) to manage and deploy a SageMaker training pipeline. ACK allows you to take advantage of managed model building pipelines without needing to define resources outside of the Kubernetes cluster.
In this post, we introduce an example to help DevOps engineers manage the entire ML lifecycle—including training and inference—using the same toolkit.
We consider a use case in which an ML engineer configures a SageMaker model building pipeline using a Jupyter notebook. This configuration takes the form of a Directed Acyclic Graph (DAG) represented as a JSON pipeline definition. The JSON document can be stored and versioned in an Amazon Simple Storage Service (Amazon S3) bucket. If encryption is required, it can be implemented using an AWS Key Management Service (AWS KMS) managed key for Amazon S3. A DevOps engineer with access to fetch this definition file from Amazon S3 can load the pipeline definition into an ACK service controller for SageMaker, which is running as part of an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. The DevOps engineer can then use the Kubernetes APIs provided by ACK to submit the pipeline definition and initiate one or more pipeline runs in SageMaker. This entire workflow is shown in the following solution diagram.
To follow along, you should have the following prerequisites:
iam:CreateRole
, iam:AttachRolePolicy
, and iam:PutRolePolicy
) to allow creating roles and attaching policies to roles.The SageMaker ACK service controller makes it straightforward for DevOps engineers to use Kubernetes as their control plane to create and manage ML pipelines. To install the controller in your EKS cluster, complete the following steps:
The following tutorial provides step-by-step instructions with the required commands to install the ACK service controller for SageMaker.
In most companies, ML engineers are responsible for creating the ML pipeline in their organization. They often work with DevOps engineers to operate those pipelines. In SageMaker, ML engineers can use the SageMaker Python SDK to generate a pipeline definition in JSON format. A SageMaker pipeline definition must follow the provided schema, which includes base images, dependencies, steps, and instance types and sizes that are needed to fully define the pipeline. This definition then gets retrieved by the DevOps engineer for deploying and maintaining the infrastructure needed for the pipeline.
The following is a sample pipeline definition with one training step:
With SageMaker, ML model artifacts and other system artifacts are encrypted in transit and at rest. SageMaker encrypts these by default using the AWS managed key for Amazon S3. You can optionally specify a custom key using the KmsKeyId
property of the OutputDataConfig
argument. For more information on how SageMaker protects data, see Data Protection in Amazon SageMaker.
Furthermore, we recommend securing access to the pipeline artifacts, such as model outputs and training data, to a specific set of IAM roles created for data scientists and ML engineers. This can be achieved by attaching an appropriate bucket policy. For more information on best practices for securing data in Amazon S3, see Top 10 security best practices for securing data in Amazon S3.
In the Kubernetes world, objects are the persistent entities in the Kubernetes cluster used to represent the state of your cluster. When you create an object in Kubernetes, you must provide the object specification that describes its desired state, as well as some basic information about the object (such as a name). Then, using tools such as kubectl, you provide the information in a manifest file in YAML (or JSON) format to communicate with the Kubernetes API.
Refer to the following Kubernetes YAML specification for a SageMaker pipeline. DevOps engineers need to modify the .spec.pipelineDefinition
key in the file and add the ML engineer-provided pipeline JSON definition. They then prepare and submit a separate pipeline execution YAML specification to run the pipeline in SageMaker. There are two ways to submit a pipeline YAML specification:
In this post, we use the first option and prepare the YAML specification (my-pipeline.yaml
) as follows:
To submit your prepared pipeline specification, apply the specification to your Kubernetes cluster as follows:
Refer to the following Kubernetes YAML specification for a SageMaker pipeline. Prepare the pipeline execution YAML specification (pipeline-execution.yaml
) as follows:
To start a run of the pipeline, use the following code:
To list all pipelines created using the ACK controller, use the following command:
To list all pipeline runs, use the following command:
To get more details about the pipeline after it’s submitted, like checking the status, errors, or parameters of the pipeline, use the following command:
To troubleshoot a pipeline run by reviewing more details about the run, use the following command:
Use the following command to delete any pipelines you created:
Use the following command to cancel any pipeline runs you started:
In this post, we presented an example of how ML engineers familiar with Jupyter notebooks and SageMaker environments can efficiently work with DevOps engineers familiar with Kubernetes and related tools to design and maintain an ML pipeline with the right infrastructure for their organization. This enables DevOps engineers to manage all the steps of the ML lifecycle with the same set of tools and environment they are used to, which enables organizations to innovate faster and more efficiently.
Explore the GitHub repository for ACK and the SageMaker controller to start managing your ML operations with Kubernetes.
Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved…
Today we are announcing the general availability of Amazon Bedrock Prompt Management, with new features…
The rise of Natural Language Processing (NLP) combined with traditional Structured Query Language (SQL) has…
Through his wealth and cultural influence, Elon Musk undoubtedly strengthened the Trump campaign. WIRED unpacks…
The growing use of artificial intelligence (AI)-based models is placing greater demands on the electronics…
This post is co-written with Steven Craig from Hearst. To maintain their competitive edge, organizations…