ML 8411 arch diag
Monitoring machine learning (ML) predictions can help improve the quality of deployed models. Capturing the data from inferences made in production can enable you to monitor your deployed models and detect deviations in model quality. Early and proactive detection of these deviations enables you to take corrective actions, such as retraining models, auditing upstream systems, or fixing quality issues.
AWS Lambda is a serverless compute service that can provide real-time ML inference at scale. In this post, we demonstrate a sample data capture feature that can be deployed to a Lambda ML inference workload.
In December 2020, Lambda introduced support for container images as a packaging format. This feature increased the deployment package size limit from 500 MB to 10 GB. Prior to this feature launch, the package size constraint made it difficult to deploy ML frameworks like TensorFlow or PyTorch to Lambda functions. After the launch, the increased package size limit made ML a viable and attractive workload to deploy to Lambda. In 2021, ML inference was one of the fastest growing workload types in the Lambda service.
Amazon SageMaker, Amazon’s fully managed ML service, contains its own model monitoring feature. However, the sample project in this post shows how to perform data capture for use in model monitoring for customers who use Lambda for ML inference. The project uses Lambda extensions to capture inference data in order to minimize the impact on the performance and latency of the inference function. Using Lambda extensions also minimizes the impact on function developers. By integrating via an extension, the monitoring feature can be applied to multiple functions and maintained by a centralized team.
This project contains source code and supporting files for a serverless application that provides real-time inferencing using a distilbert-base, pretrained question answering model. The project uses the Hugging Face question and answer natural language processing (NLP) model with PyTorch to perform natural language inference tasks. The project also contains a solution to perform inference data capture for the model predictions. The Lambda function writer can determine exactly which data from the inference request input and the prediction result to send to the extension. In this solution, we send the input and the answer from the model to the extension. The extension then periodically sends the data to an Amazon Simple Storage Service (Amazon S3) bucket. We build the data capture extension as a container image using a makefile
. We then build the Lambda inference function as a container image and add the extension container image as a container image layer. The following diagram shows an overview of the architecture.
Lambda extensions are a way to augment Lambda functions. In this project, we use an external Lambda extension to log the inference request and the prediction from the inference. The external extension runs as a separate process in the Lambda runtime environment, diminishing the impact on the inference function. However, the function shares resources such as CPU, memory, and storage with the Lambda function. We recommend allocating enough memory to the Lambda function to ensure optimal resource availability. (In our testing, we allocated 5 GB of memory to the inference Lambda function and saw optimal resource availability and inference latency). When an inference is complete, the Lambda service returns the response immediately and doesn’t wait for the extension to finish logging the request and response to the S3 bucket. With this pattern, the monitoring extension doesn’t affect the inference latency. To learn more about Lambda extensions check out these video series.
This project uses the AWS Serverless Application Model (AWS SAM) command line interface (CLI). This command-line tool allows developers to initialize and configure applications; package, build, and test locally; and deploy to the AWS Cloud.
You can download the source code for this project from the GitHub repository.
This project includes the following files and folders:
The application uses several AWS resources, including Lambda functions and an Amazon API Gateway API. These resources are defined in the template.yaml
file in this project. You can update the template to add AWS resources through the same deployment process that updates your application code.
For this walkthrough, you should have the following prerequisites:
To build your application for the first time, complete the following steps:
.aws-sam
directoryaws ecr create-repository
repository-name serverless-ml-model-monitor
--image-scanning-configuration scanOnPush=true
--region us-east-1
We build again because Lambda doesn’t support Lambda layers directly for the container image packaging type. We need to first build the model monitoring component as a container image, upload it to Amazon ECR, and then use that image in the model monitoring application as a container layer.
This command packages and deploys your application to AWS with a series of prompts:
yes
, any change sets are shown to you before running for manual review. If set to no, the AWS SAM CLI automatically deploys application changes.CAPABILITY_IAM
value for capabilities
must be provided. If permission isn’t provided through this prompt, to deploy this example you must explicitly pass --capabilities CAPABILITY_IAM
to the sam deploy
command.yes
, your choices are saved to a configuration file inside the project so that in the future, you can just run sam deploy
without parameters to deploy changes to your application.You can find your API Gateway endpoint URL in the output values displayed after deployment.
To test the application, use Postman or curl to send a request to the API Gateway endpoint. For example:
You should see output like the following code. The ML model inferred from the context and returned the answer for our question.
After a few minutes, you should see a file in the S3 bucket nlp-qamodel-model-monitoring-modelmonitorbucket-<xxxxxx>
with the input and the inference logged.
To delete the sample application that you created, use the AWS CLI:
In this post, we implemented a model monitoring feature as a Lambda extension and deployed it to a Lambda ML inference workload. We showed how to build and deploy this solution to your own AWS account. Finally, we showed how to run a test to verify the functionality of the monitor.
Please provide any thoughts or questions in the comments section. For more serverless learning resources, visit Serverless Land.
The large language model (LLM) has become a cornerstone of many AI applications.
Computer use is a breakthrough capability from Anthropic that allows foundation models (FMs) to visually…
OpenAI's new API and Agents SDK consolidate a previously fragmented complex ecosystem into a unified,…
A directive from the National Institute of Standards and Technology eliminates mention of “AI safety”…
Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of…
This post was co-written with Vishal Singh, Data Engineering Leader at Data & Analytics team…