MLBLOG64069 im01 1024x432 1
Veriff is an identity verification platform partner for innovative growth-driven organizations, including pioneers in financial services, FinTech, crypto, gaming, mobility, and online marketplaces. They provide advanced technology that combines AI-powered automation with human feedback, deep insights, and expertise.
Veriff delivers a proven infrastructure that enables their customers to have trust in the identities and personal attributes of their users across all the relevant moments in their customer journey. Veriff is trusted by customers such as Bolt, Deel, Monese, Starship, Super Awesome, Trustpilot, and Wise.
As an AI-powered solution, Veriff needs to create and run dozens of machine learning (ML) models in a cost-effective way. These models range from lightweight tree-based models to deep learning computer vision models, which need to run on GPUs to achieve low latency and improve the user experience. Veriff is also currently adding more products to its offering, targeting a hyper-personalized solution for its customers. Serving different models for different customers adds to the need for a scalable model serving solution.
In this post, we show you how Veriff standardized their model deployment workflow using Amazon SageMaker, reducing costs and development time.
Veriff’s backend architecture is based on a microservices pattern, with services running on different Kubernetes clusters hosted on AWS infrastructure. This approach was initially used for all company services, including microservices that run expensive computer vision ML models.
Some of these models required deployment on GPU instances. Conscious of the comparatively higher cost of GPU-backed instance types, Veriff developed a custom solution on Kubernetes to share a given GPU’s resources between different service replicas. A single GPU typically has enough VRAM to hold multiple of Veriff’s computer vision models in memory.
Although the solution did alleviate GPU costs, it also came with the constraint that data scientists needed to indicate beforehand how much GPU memory their model would require. Furthermore, DevOps were burdened with manually provisioning GPU instances in response to demand patterns. This caused an operational overhead and overprovisioning of instances, which resulted in a suboptimal cost profile.
Apart from GPU provisioning, this setup also required data scientists to build a REST API wrapper for each model, which was needed to provide a generic interface for other company services to consume, and to encapsulate preprocessing and postprocessing of model data. These APIs required production-grade code, which made it challenging for data scientists to productionize models.
Veriff’s data science platform team looked for alternative ways to this approach. The main objective was to support the company’s data scientists with a better transition from research to production by providing simpler deployment pipelines. The secondary objective was to reduce the operational costs of provisioning GPU instances.
Veriff required a new solution that solved two problems:
Ultimately, the ML platform team converged on the decision to use Sagemaker multi-model endpoints (MMEs). This decision was driven by MME’s support for NVIDIA’s Triton Inference Server (an ML-focused server that makes it easy to wrap models as REST APIs; Veriff was also already experimenting with Triton), as well as its capability to natively manage the auto scaling of GPU instances via simple auto scaling policies.
Two MMEs were created at Veriff, one for staging and one for production. This approach allows them to run testing steps in a staging environment without affecting the production models.
SageMaker is a fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly. SageMaker MMEs provide a scalable and cost-effective solution for deploying a large number of models for real-time inference. MMEs use a shared serving container and a fleet of resources that can use accelerated instances such as GPUs to host all of your models. This reduces hosting costs by maximizing endpoint utilization compared to using single-model endpoints. It also reduces deployment overhead because SageMaker manages loading and unloading models in memory and scaling them based on the endpoint’s traffic patterns. In addition, all SageMaker real-time endpoints benefit from built-in capabilities to manage and monitor models, such as including shadow variants, auto scaling, and native integration with Amazon CloudWatch (for more information, refer to CloudWatch Metrics for Multi-Model Endpoint Deployments).
There were several reasons why Veriff decided to use Triton Inference Server, the main ones being:
Using Triton allows data scientists to deploy models with ease because they only need to build formatted model repositories instead of writing code to build REST APIs (Triton also supports Python models if custom inference logic is required). This decreases model deployment time and gives data scientists more time to focus on building models instead of deploying them.
Another important feature of Triton is that it allows you to build model ensembles, which are groups of models that are chained together. These ensembles can be run as if they were a single Triton model. Veriff currently employs this feature to deploy preprocessing and postprocessing logic with each ML model using Python models (as mentioned earlier), ensuring that there are no mismatches in the input data or model output when models are used in production.
The following is what a typical Triton model repository looks like for this workload:
The model.py
file contains preprocessing and postprocessing code. The trained model weights are in the screen_detection_inferencer
directory, under model version 1
(model is in ONNX format in this example, but can also be TensorFlow, PyTorch format, or others). The ensemble model definition is in the screen_detection_pipeline
directory, where inputs and outputs between steps are mapped in a configuration file.
Additional dependencies needed to run the Python models are detailed in a requirements.txt
file, and need to be conda-packed to build a Conda environment (python_env.tar.gz)
. For more information, refer to Managing Python Runtime and Libraries. Also, config files for Python steps need to point to python_env.tar.gz
using the EXECUTION_ENV_PATH directive.
The model folder then needs to be TAR compressed and renamed using model_version.txt
. Finally, the resulting <model_name>_<model_version>.tar.gz
file is copied to the Amazon Simple Storage Service (Amazon S3) bucket connected to the MME, allowing SageMaker to detect and serve the model.
As the previous section made apparent, building a Triton model repository is straightforward. However, running all the necessary steps to deploy it is tedious and error prone, if run manually. To overcome this, Veriff built a monorepo containing all models to be deployed to MMEs, where data scientists collaborate in a Gitflow-like approach. This monorepo has the following features:
This monorepo is integrated with a continuous integration (CI) tool. For every new push to the repo or new model, the following steps are run:
<model_name>_<model_version>.tar.gz
).These steps make sure that models have the quality required for deployment, so for every push to a repo branch, the resulting TAR file is copied (in another CI step) to the staging S3 bucket. When pushes are done in the main branch, the model file is copied to the production S3 bucket. The following diagram depicts this CI/CD system.
Using MMEs allows Veriff to use a monorepo approach to deploy models to production. In summary, Veriff’s new model deployment workflow consists of the following steps:
With this new solution in place, deploying a model at Veriff is a straightforward part of the development process. New model development time has decreased from 10 days to an average of 2 days.
The managed infrastructure provisioning and auto scaling features of SageMaker brought Veriff added benefits. They used the InvocationsPerInstance CloudWatch metric to scale according to traffic patterns, saving on costs without sacrificing reliability. To define the threshold value for the metric, they performed load testing on the staging endpoint to find the best trade-off between latency and cost.
After deploying seven production models to MMEs and analyzing spend, Veriff reported a 75% cost reduction in GPU model serving as compared to the original Kubernetes-based solution. Operational costs were reduced as well, because the burden of provisioning instances manually was lifted from the company’s DevOps engineers.
In this post, we reviewed why Veriff chose Sagemaker MMEs over self-managed model deployment on Kubernetes. SageMaker takes on the undifferentiated heavy lifting, allowing Veriff to decrease model development time, increase engineering efficiency, and dramatically lower the cost for real-time inference while maintaining the performance needed for their business-critical operations. Finally, we showcased Veriff’s simple yet effective model deployment CI/CD pipeline and model versioning mechanism, which can be used as a reference implementation of combining software development best practices and SageMaker MMEs. You can find code samples on hosting multiple models using SageMaker MMEs on GitHub.
The large language model (LLM) has become a cornerstone of many AI applications.
Computer use is a breakthrough capability from Anthropic that allows foundation models (FMs) to visually…
OpenAI's new API and Agents SDK consolidate a previously fragmented complex ecosystem into a unified,…
A directive from the National Institute of Standards and Technology eliminates mention of “AI safety”…
Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of…
This post was co-written with Vishal Singh, Data Engineering Leader at Data & Analytics team…