In the ever-evolving landscape of machine learning, efficient model management is critical. This is especially true for the realm of fraud detection, where the models have to be redeployed frequently as they act against human adversaries, who can reverse-engineer fraud model logic and adapt their tactics accordingly.
Within one of the world’s leading local delivery platforms Delivery Hero, the Incentive Fraud team is responsible for building ML-powered, rule-based services for detecting and preventing abuse of incentive vouchers. For example, these vouchers can be granted to newly registered users to motivate them to use the food delivery platform, so it should be possible to reliably identify the genuinely new customers from those creating new accounts for each order. This task becomes especially challenging given that Delivery Hero operates in 70+ countries, each imposing various data protection regulations that often have different local constraints.
At a high level, the team’s setup is a REST API service, implementing rule-based logic on top of the decisions provided by a set of interconnected ML models. The setup has tight latency restrictions as the API is called on each food order request. To achieve those latency requirements, the service and the models run in regional Kubernetes clusters with horizontal autoscaler and other high-availability practices.
The diagram below shows the high-level model serving architecture:
The team chose Vertex AI for its ML model development environment for its scalability and tight integration with BigQuery — the primary data warehouse at Delivery Hero — and other Google Cloud services. The models are trained in Vertex AI Pipelines which stores their metadata in Vertex AI Model Registry. Once trained and analyzed, the models are built into a FastAPI docker image by Cloud Build.
In order to enable fast iterations of the model development, all workflows were deeply integrated with GitHub Actions CI/CD, thus allowing users to train models and build images while following software engineering and MLOps best practices:
One of the reasons for implementing CT for the models was the need to enable the development and maintenance of the multiple models trained while sharing the same code base but being trained on different subsets of data. A typical example within Delivery Hero is to maintain tens of models, one per country, and deploy them in a regional cluster (EMEA, APAC, etc.). Thus, the decision on each model has to be taken individually, however, the development, evaluation and sometimes deployment iterations are shared across the models.
For the Incentive Fraud use-case, it was possible to implement such ML Operations workflows using the GCP Vertex AI Model Registry at the backend, with the use of an internal Python package ml-utils developed by the team. This package provides a single CLI or Python API interface to link together the entities of Kubeflow pipelines (used internally by GCP Vertex AI Pipelines): Pipelines (or Pipeline Runs), Experiments (grouping of the pipelines), and Models and Datasets (output artifacts of the pipelines). Internally, ml-utils loads the large json definitions of the Kubeflow Pipeline Runs, finds the required artifacts and downloads them in a pre-defined format. More importantly, it provides an abstraction layer to the Models enforcing naming conventions, and provides functionality to query the Vertex ML Metadata in order to search for the models by wildcards.
The picture below illustrates the use of the described CI/CD workflow based on Vertex AI with the use of the custom tool ml-utils:
As shown in the picture above, each commit the user creates in the GitHub repo triggers a set of training pipelines in Vertex AI, one for each country. Some of the pipelines might fail (marked as red), and some succeed (green).
Steps of the Vertex AI training pipeline (see screenshot below):
All the artifacts of the pipeline (data slices, models, etc.) are automatically saved to GCS. Once all the Vertex AI pipelines have succeeded, the GitHub Actions job that had triggered the pipelines, uses ml-utils to query the Vertex AI Model Registry to retrieve the evaluation metrics and print them as markdown to the GitHub Actions job Summary page for visibility (see picture below). Thus, each git commit is linked to a set of Vertex pipelines and to the model quality report, which is used by data scientists and managers to make decisions after interpreting the models’ quality.
Once the team is ready to redeploy some of the models, they create a PR to modify the serving image config, which defines the slice of the models to be deployed to production. This is the blue dashed line in Vertex Model Registry. This PR triggers another GitHub Actions workflow that submits a Cloudbuild workflow, which loads the specified model pickles, builds the FastAPI server image with the models baked in it, runs integration tests, and updates the aliases of the models (adds the alias “image-{imagetag}”).
Below is the screenshot of GitHub Actions page with model training Summary page for one of the models:
For the Incentive Fraud projects, the team used two environments:
To facilitate work with the two projects, the team implemented five high-level rules:
These rules allow the team to have a clean linear history of the main branch, where each commit, if it changes model code, dataset version or configuration, builds a set of per-country release candidate models with expected quality metrics:
As a result, the described MLOps setup has allowed the team to accomplish the KPIs of drastically reducing the time model server release time from days to one hour by automating the whole process and implementing software engineering and MLOps best practices.
Specifically, the team was able to:
Achievement highlights:
If you want to learn more about how Delivery Hero built a BigQuery powered data mesh, achieving data democratization across the company and accelerating data-driven solutions, read the case study.
Our next iteration of the FSF sets out stronger security protocols on the path to…
Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this…
Generative AI has revolutionized technology through generating content and solving complex problems. To fully take…
At Google Cloud, we're deeply invested in making AI helpful to organizations everywhere — not…
Advanced Micro Devices reported revenue of $7.658 billion for the fourth quarter, up 24% from…