This post is co-written with Jayadeep Pabbisetty, Sr. Specialist Data Engineering at Merck, and Prabakaran Mathaiyan, Sr. ML Engineer at Tiger Analytics.
The large machine learning (ML) model development lifecycle requires a scalable model release process similar to that of software development. Model developers often work together in developing ML models and require a robust MLOps platform to work in. A scalable MLOps platform needs to include a process for handling the workflow of ML model registry, approval, and promotion to the next environment level (development, test, UAT, or production).
A model developer typically starts to work in an individual ML development environment within Amazon SageMaker. When a model is trained and ready to be used, it needs to be approved after being registered in the Amazon SageMaker Model Registry. In this post, we discuss how the AWS AI/ML team collaborated with the Merck Human Health IT MLOps team to build a solution that uses an automated workflow for ML model approval and promotion with human intervention in the middle.
This post focuses on a workflow solution that the ML model development lifecycle can use between the training pipeline and inferencing pipeline. The solution provides a scalable workflow for MLOps in supporting the ML model approval and promotion process with human intervention. An ML model registered by a data scientist needs an approver to review and approve before it is used for an inference pipeline and in the next environment level (test, UAT, or production). The solution uses AWS Lambda, Amazon API Gateway, Amazon EventBridge, and SageMaker to automate the workflow with human approval intervention in the middle. The following architecture diagram shows the overall system design, the AWS services used, and the workflow for approving and promoting ML models with human intervention from development to production.
The workflow includes the following steps:
PendingManualApproval
.Approved
for the dev environment, but PendingManualApproval
for test, UAT, and production).The workflow in this post assumes the environment for the training pipeline is set up in SageMaker, along with other resources. The input to the training pipeline is the features dataset. The feature generation details are not included in this post, but it focuses on the registry, approval, and promotion of ML models after they are trained. The model is registered in the model registry and is governed by a monitoring framework in Amazon SageMaker Model Monitor to detect for any drift and proceed to retraining in case of model drift.
The approval workflow starts with a model developed from a training pipeline. When data scientists develop a model, they register it to the SageMaker Model Registry with the model status of PendingManualApproval
. EventBridge monitors SageMaker for the model registration event and triggers an event rule that invokes a Lambda function. The Lambda function dynamically constructs an email for an approval of the model with a link to an API Gateway endpoint to another Lambda function. When the approver follows the link to approve the model, API Gateway forwards the approval action to the Lambda function, which updates the SageMaker Model Registry and the model attributes in Parameter Store. The approver must be authenticated and part of the approver group managed by Active Directory. The initial approval marks the model as Approved
for dev but PendingManualApproval
for test, UAT, and production. The model attributes saved in Parameter Store include the model version, model package, and approved target environment.
When an inference pipeline needs to fetch a model, it checks Parameter Store for the latest model version approved for the target environment and gets the inference details. When the inference pipeline is complete, a post-inference notification email is sent to a stakeholder requesting an approval to promote the model to the next environment level. The email has the details about the model and metrics as well as an approval link to an API Gateway endpoint for a Lambda function that updates the model attributes.
The following is the sequence of events and implementation steps for the ML model approval/promotion workflow from model creation to production. The model is promoted from development to test, UAT, and production environments with an explicit human approval in each step.
We start with the training pipeline, which is ready for model development. The model version starts as 0 in SageMaker Model Registry.
dev, test, uat
, and prod
.The complete history of the model versioning and approval is saved for review in Parameter Store.
The large ML model development lifecycle requires a scalable ML model approval process. In this post, we shared an implementation of an ML model registry, approval, and promotion workflow with human intervention using SageMaker Model Registry, EventBridge, API Gateway, and Lambda. If you are considering a scalable ML model development process for your MLOps platform, you can follow the steps in this post to implement a similar workflow.
Our next iteration of the FSF sets out stronger security protocols on the path to…
Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this…
Generative AI has revolutionized technology through generating content and solving complex problems. To fully take…
At Google Cloud, we're deeply invested in making AI helpful to organizations everywhere — not…
Advanced Micro Devices reported revenue of $7.658 billion for the fourth quarter, up 24% from…