ML 9863 image001
Amazon SageMaker Pipelines is a continuous integration and continuous delivery (CI/CD) service designed for machine learning (ML) use cases. You can use it to create, automate, and manage end-to-end ML workflows. It tackles the challenge of orchestrating each step of an ML process, which requires time, effort, and resources. To facilitate its use, multiple templates are available that you can customize according to your needs.
Fully managed image and video analysis services have also accelerated the adoption of Computer vision solutions. AWS offers a pre-trained and fully managed AWS AI service called Amazon Rekognition that can be integrated into computer vision applications using API calls and require no ML experience. You just have to provide an image to the Amazon Rekognition API and it can identify the required objects according to pre-defined labels. It is also possible to provide custom labels specific to your use case and build a customized computer vision model with little to no overhead need for ML expertise.
In this post, we address a specific computer vision problem: skin lesion classification, and use Pipelines by customizing an existing template and tailoring it to this task. Accurate skin lesion classification can help with early diagnosis of cancer diseases. However, it’s a challenging task in the medical field, because there is a high similarity between different kinds of skin lesions. Pipelines allows us to take advantage of a variety of existing models and algorithms, and establish an end-to-end productionized pipeline with minimal effort and time.
In this post, we build an end-to-end pipeline using Pipelines to classify dermatoscopic images of common pigmented skin lesions. We use the Amazon SageMaker Studio project template MLOps template for building, training, and deploying models and the code in the following GitHub repository. The resulting architecture is shown in the following figure.
For this pipeline, we use the HAM10000 (“Human Against Machine with 10000 training images”) dataset, which consists of 10,015 dermatoscopic images. The task at hand is a multi-class classification in the field of computer vision. This dataset depicts six of the most important diagnostic categories in the realm of pigmented lesions: actinic keratoses and intraepithelial carcinoma or Bowen’s disease (akiec
), basal cell carcinoma (bcc
), benign keratosis-like lesions (solar lentigines or seborrheic keratoses and lichen-planus like keratoses, bkl
), dermatofibroma (df
), melanoma (mel
), melanocytic nevi (nv
), and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc
).
For the format of the model’s input, we use the RecordIO
format. This is a compact format that stores image data together for continuous reading and therefore faster and more efficient training. In addition, one of the challenges of using the HAM10000 dataset is the class imbalance. The following table illustrates the class distribution.
Class | akiec | bcc | bkl | df | mel | nv | vasc |
Number of images | 327 | 514 | 1099 | 115 | 1113 | 6705 | 142 |
Total | 10015 |
To address this issue, we augment the dataset using random transformations (such as cropping, flipping, mirroring, and rotating) to have all classes with approximately the same number of images.
This preprocessing step uses MXNet and OpenCV, therefore it uses a pre-built MXNet container image. The rest of the dependencies are installed using a requirements.txt
file. If you want to create and use a custom image, refer to Create Amazon SageMaker projects with image building CI/CD pipelines.
For the training step, we use the estimator available from the SageMaker built-in Scikit Docker image for image classification and set the parameters as follows:
For further details about the container image, refer to Image Classification Algorithm.
For detailed instructions on how to set up Studio, refer to Onboard to Amazon SageMaker Domain Using Quick setup. To create your project, complete the following steps:
The project takes a few minutes to be created.
To prepare the dataset, complete the following steps:
sagemaker
(this allows SageMaker to access the bucket without any extra permissions).dataverse_files.zip
to the bucket.Because we’re using MXNet and OpenCV in our preprocessing step, we use a pre-built MXNet Docker image and install the remaining dependencies using the requirements.txt
file. To do so, you need to copy it and paste it under pipelines/skin
in the sagemaker-<pipeline-name>-modelbuild
repository. In addition, add the MANIFEST.in
file at the same level as setup.py
, to tell Python to include the requirements.txt
file. For more information about MANIFEST.in, refer to Including files in source distributions with MANIFEST.in. Both files can be found in the GitHub repository.
To update the Pipelines template, complete the following steps:
pipelines
directory inside the sagemaker-<pipeline-name>-modelbuild
directory and rename the abalone
directory to skin
.codebuild-buildspec.yml
file in the sagemaker-<pipeline-name>-modelbuild
directory and modify the run pipeline path from run-pipeline —module-name pipelines.abalone.pipeline
(line 15) to the following: pipelines.py
, preprocess.py
, and evaluate.py
in the pipelines directory with the files from the GitHub repository.preprocess.py
file (lines 183-186) with the S3 location (SKIN_CANCER_BUCKET
) and folder name (SKIN_CANCER_BUCKET_PATH
) where you uploaded the dataverse_files.zip
archive: skin_cancer_bucket=”<bucket-name-containing-dataset>”
skin_cancer_bucket_path=”<prefix-to-dataset-inside-bucket>”
skin_cancer_files=”<dataset-file-name-without-extension>”
skin_cancer_files_ext=”<dataset-file-name-with-extension>”
In the preceding example, the dataset would be stored under s3://monai-bucket-skin-cancer/skin_cancer_bucket_prefix/dataverse_files.zip
.
Pushing committed changes to the CodeCommit repository (done on the Studio source control tab) triggers a new pipeline run, because an Amazon EventBridge event monitors for commits. We can monitor the run by choosing the pipeline inside the SageMaker project. The following screenshot shows an example of a pipeline that ran successfully.
-checkpoint
file. You can add an entry to the .gitignore
file with *checkpoint.*
to ignore them.codebuild-buildspec.yml
have to run successfully. To check on the status of these steps, you can use AWS CodeBuild. For more information, refer to AWS CodeBuild (AMS SSPS).ModelApprovalStatus
is set to PendingManualApproval
. If our model has greater than 60% accuracy, it’s added to the model registry, but not deployed until manual approval is complete.InService
status.CodePipeline
console, choose the sagemaker-<pipeline-name>-modeldeploy
pipeline that is currently in progress.DeployStaging
phase, you need to manually approve the deployment.After this step, you can see the production endpoint being deployed on the SageMaker Endpoints page. After a while, the endpoint shows as InService
.
You can easily clean up all the resources created by the SageMaker project.
We successfully used Pipelines to create an end-to-end MLOps framework for skin lesion classification using a built-in model on the HAM10000 dataset. For the parameters provided in the repository, we obtained the following results on the test set.
Metric | Precision | Recall | F1 score |
Value | 0.643 | 0.8 | 0.713 |
You can work further on improving the performance of the model by fine-tuning its hyperparameters, adding more transformations for data augmentation, or using other methods, such as Synthetic Minority Oversampling Technique (SMOTE) or Generative Adversarial Networks (GANs). Furthermore, you can use your own model or algorithm for training by using built-in SageMaker Docker images or adapting your own container to work on SageMaker. For further details, refer to Using Docker containers with SageMaker.
You can also add additional features to your pipeline. If you want to include monitoring, you can choose the MLOps template for model building, training, deployment and monitoring template when creating the SageMaker project. The resulting architecture has an additional monitoring step. Or if you have an existing third-party Git repository, you can use it by choosing the MLOps template for model building, training, and deployment with third-party Git repositories using Jenkins project and providing information for both model building and model deployment repositories. This allows you to utilize any existing code and saves you any time or effort on integration between SageMaker and Git. However, for this option, a AWS CodeStar connection is required.
In this post, we showed how to create an end-to-end ML workflow using Studio and automated Pipelines. The workflow includes getting the dataset, storing it in a place accessible to the ML model, configuring a container image for preprocessing, then modifying the boilerplate code to accommodate such image. Then we showed how to trigger the pipeline, the steps that the pipeline follows, and how they work. We also discussed how to monitor model performance and deploy the model to an endpoint.
We performed most of these tasks within Studio, which acts as an all-encompassing ML IDE, and accelerates the development and deployment of such models.
This solution is not bound to the skin classification task. You can extend it to any classification or regression task using any of the SageMaker built-in algorithms or pre-trained models.
Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of…
This post was co-written with Vishal Singh, Data Engineering Leader at Data & Analytics team…
At Definity, a leading Canadian P&C insurer with a history spanning over 150 years, we…
Don't expect to hear a lot about better framerates and raytracing at the Nvidia GTC…
The team working at the Social Security Administration appears to be among the largest DOGE…
Many companies invest heavily in hiring talent to create the high-performance library code that underpins…