One of the most popular models available today is XGBoost. With the ability to solve various problems such as classification and regression, XGBoost has become a popular option that also falls into the category of tree-based models. In this post, we dive deep to see how Amazon SageMaker can serve these models using NVIDIA Triton Inference Server. Real-time inference workloads can have varying levels of requirements and service level agreements (SLAs) in terms of latency and throughput, and can be met using SageMaker real-time endpoints.
SageMaker provides single model endpoints, which allow you to deploy a single machine learning (ML) model against a logical endpoint. For other use cases, you can choose to manage cost and performance using multi-model endpoints, which allow you to specify multiple models to host behind a logical endpoint. Regardless of the option the you choose, SageMaker endpoints allow a scalable mechanism for even the most demanding enterprise customers while providing value in a plethora of features, including shadow variants, auto scaling, and native integration with Amazon CloudWatch (for more information, refer to CloudWatch Metrics for Multi-Model Endpoint Deployments).
Triton supports various backends as engines to support the running and serving of various ML models for inference. For any Triton deployment, it’s crucial to know how the backend behavior impacts your workloads and what to expect so that you can be successful. In this post, we help you understand the Forest Inference Library (FIL) backend, which is supported by Triton on SageMaker, so that you can make an informed decision for your workloads and get the best performance and cost optimization possible.
Triton supports the FIL backend to serve tree models, such as XGBoost, LightGBM, scikit-learn Random Forest, RAPIDS cuML Random Forest, and any other model supported by Treelite. These models have long been used for solving problems such as classification or regression. Although these types of models have traditionally run on CPUs, the popularity of these models and inference demands have led to various techniques to increase inference performance. The FIL backend utilizes many of these techniques by using cuML constructs and is built on C++ and the CUDA core library to optimize inference performance on GPU accelerators.
The FIL backend uses cuML’s libraries to use CPU or GPU cores to accelerate learning. In order to use these processors, data is referenced from host memory (for example, NumPy arrays) or GPU arrays (uDF, Numba, cuPY, or any library that supports the __cuda_array_interface__
) API. After the data is staged in memory, the FIL backend can run processing across all the available CPU or GPU cores.
The FIL backend threads can communicate with each other without utilizing shared memory of the host, but in ensemble workloads, host memory should be considered. The following diagram shows an ensemble scheduler runtime architecture where you have the ability to fine-tune the memory areas, including CPU addressable shared memory that is used for inter-process communication between Triton (C++) and the Python process (Python backend) for exchanging tensors (input/output) with the FIL backend.
Triton Inference Server provides configurable options for developers to tune their workloads and optimize model performance. The configuration dynamic_batching
allows Triton to hold client-side requests and batch them on the server side in order to efficiently use FIL’s parallel computation to inference the entire batch together. The option max_queue_delay_microseconds
offers a fail-safe control of how long Triton waits to form a batch.
There are a number of other FIL-specific options available that impact performance and behavior. We suggest starting with storage_type
. When running the backend on GPU, FIL creates a new memory/data structure that is a representation of the tree for which FIL can impact performance and footprint. This is configurable via the environment parameter storage_type
, which has the options dense, sparse, and auto. Choosing the dense option will consume more GPU memory and doesn’t always result in better performance, so it’s best to check. In contrast, the sparse option will consume less GPU memory and can possibly perform as well or better than dense. Choosing auto will cause the model to default to dense unless doing so will consume significantly more GPU memory than sparse.
When it comes to model performance, you might consider emphasizing the threads_per_tree
option. One thing that you may overserve in real-world scenarios is that threads_per_tree
can have a bigger impact on throughput than any other parameter. Setting it to any power of 2 from 1–32 is legitimate. The optimal value is hard to predict for this parameter, but when the server is expected to deal with higher load or process larger batch sizes, it tends to benefit from a larger value than when it’s processing a few rows at a time.
Another parameter to be aware of is algo
, which is also available if you’re running on GPU. This parameter determines the algorithm that’s used to process the inference requests. The options supported for this are ALGO_AUTO
, NAIVE
, TREE_REORG
, and BATCH_TREE_REORG
. These options determine how nodes within a tree are organized and can also result in performance gains. The ALGO_AUTO
option defaults to NAIVE
for sparse storage and BATCH_TREE_REORG
for dense storage.
Lastly, FIL comes with Shapley explainer, which can be activated by using the treeshap_output
parameter. However, you should keep in mind that Shapley outputs hurt performance due to its output size.
There is currently no standard file format to store forest-based models; every framework tends to define its own format. In order to support multiple input file formats, FIL imports data using the open-source Treelite library. This enables FIL to support models trained in popular frameworks, such as XGBoost and LightGBM. Note that the format of the model that you’re providing must be set in the model_type
configuration value specified in the config.pbtxt
file.
Each model in a model repository must include a model configuration that provides the required and optional information about the model. Typically, this configuration is provided in a config.pbtxt
file specified as ModelConfig protobuf. To learn more about the config settings, refer to Model Configuration. The following are some of the model configuration parameters:
predict_proba
option is set to true, then a probability value will be returned for each class. Otherwise, a single value will be returned, indicating the class predicted for the given sample.xgboost_json
in this example, but xgboost
, lightgbm
, and tl_checkpoint
are valid formats as well).output_class
is set to true, this must be provided, although it won’t be used if predict_proba
is also set to true.SageMaker allows you to deploy both single model and multi-model endpoints with NVIDIA Triton Inference Server. The following figure shows the Triton Inference Server high-level architecture. The model repository is a file system-based repository of the models that Triton will make available for inferencing. Inference requests arrive at the server and are routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis. Each model’s scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The backend performs inferencing using the inputs provided in the batched requests to produce the requested outputs. The outputs are then returned.
When configuring your auto scaling groups for SageMaker endpoints, you may want to consider SageMakerVariantInvocationsPerInstance
as the primary criteria to determine the scaling characteristics of your auto scaling group. In addition, depending on whether your models are running on GPU or CPU, you may also consider using CPUUtilization or GPUUtilization as additional criteria. Note that for single model endpoints, because the models deployed are all the same, it’s fairly straightforward to set proper policies to meet your SLAs. For multi-model endpoints, we recommend deploying similar models behind a given endpoint to have more steady predictable performance. In use cases where models of varying sizes and requirements are used, you may want to separate those workloads across multiple multi-model endpoints or spend some time fine-tuning your auto scaling group policy to obtain the best cost and performance balance.
For a list of NVIDIA Triton Deep Learning Containers (DLCs) supported by SageMaker inference, refer to Available Deep Learning Containers Images.
ML applications are complex and can often require data preprocessing. In this notebook, we dive into how to deploy a tree-based ML model like XGBoost using the FIL backend in Triton on a SageMaker multi-model endpoint. We also cover how to implement a Python-based data preprocessing inference pipeline for your model using the ensemble feature in Triton. This will allow us to send in the raw data from the client side and have both data preprocessing and model inference happen in a Triton SageMaker endpoint for optimal inference performance.
Triton Inference Server greatly simplifies the deployment of AI models at scale in production. Triton Inference Server comes with a convenient solution that simplifies building preprocessing and postprocessing pipelines. The Triton Inference Server platform provides the ensemble scheduler, which is responsible for pipelining models participating in the inference process while ensuring efficiency and optimizing throughput. Using ensemble models can avoid the overhead of transferring intermediate tensors and minimize the number of requests that must be sent to Triton.
In this notebook, we show how to use the ensemble feature for building a pipeline of data preprocessing with XGBoost model inference, and you can extrapolate from it to add custom postprocessing to the pipeline.
We begin by setting up the required environment. We install the dependencies required to package our model pipeline and run inferences using Triton Inference Server. We also define the AWS Identity and Access Management (IAM) role that will give SageMaker access to the model artifacts and the NVIDIA Triton Amazon Elastic Container Registry (Amazon ECR) image. See the following code:
The Python backend in Triton requires us to use a Conda environment for any additional dependencies. In this case, we use the Python backend to preprocess the raw data before feeding it into the XGBoost model that is running in the FIL backend. Even though we originally used RAPIDS cuDF and cuML to do the data preprocessing, here we use Pandas and scikit-learn as preprocessing dependencies during inference. We do this for three reasons:
LabelEncoders
created in cuML can be used in scikit-learn and vice versa.We follow the instructions from the Triton documentation for packaging preprocessing dependencies (scikit-learn and Pandas) to be used in the Python backend as a Conda environment TAR file. The bash script create_prep_env.sh creates the Conda environment TAR file, then we move it into the preprocessing model directory. See the following code:
After we run the preceding script, it generates preprocessing_env.tar.gz
, which we copy to the preprocessing directory:
For preprocessing, we use Triton’s Python backend to perform tabular data preprocessing (categorical encoding) during inference for raw data requests coming into the server. For more information about the preprocessing that was done during training, refer to the training notebook.
The Python backend enables preprocessing, postprocessing, and any other custom logic to be implemented in Python and served with Triton. Using Triton on SageMaker requires us to first set up a model repository folder containing the models we want to serve. We have already set up a model for Python data preprocessing called preprocessing in cpu_model_repository
and gpu_model_repository
.
Triton has specific requirements for model repository layout. Within the top-level model repository directory, each model has its own subdirectory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric subdirectory representing a version of the model. The value 1 represents version 1 of our Python preprocessing model. Each model is run by a specific backend, so within each version subdirectory there must be the model artifact required by that backend. For this example, we use the Python backend, which requires the Python file you’re serving to be called model.py, and the file needs to implement certain functions. If we were using a PyTorch backend, a model.pt file would be required, and so on. For more details on naming conventions for model files, refer to Model Files.
The model.py Python file we use here implements all the tabular data preprocessing logic to convert raw data into features that can be fed into our XGBoost model.
Every Triton model must also provide a config.pbtxt
file describing the model configuration. To learn more about the config settings, refer to Model Configuration. Our config.pbtxt file specifies the backend as python and all the input columns for raw data along with preprocessed output, which consists of 15 features. We also specify we want to run this Python preprocessing model on the CPU. See the following code:
Next, we set up the model directory for a tree-based ML model like XGBoost, which will be using the FIL backend.
The expected layout for cpu_memory_repository
and gpu_memory_repository
are similar to the one we showed earlier.
Here, FIL
is the name of the model. We can give it a different name like xgboost
if we want to. 1
is the version subdirectory, which contains the model artifact. In this case, it’s the xgboost.json
model that we saved. Let’s create this expected layout:
We need to have the configuration file config.pbtxt
describing the model configuration for the tree-based ML model, so that the FIL backend in Triton can understand how to serve it. For more information, refer to the latest generic Triton configuration options and the configuration options specific to the FIL backend. We focus on just a few of the most common and relevant options in this example.
Create config.pbtxt
for model_cpu_repository
:
Similarly, set up config.pbtxt
for model_gpu_repository
(note the difference is USE_GPU = True
):
Now we’re ready to set up the inference pipeline for data preprocessing and tree-based model inference using an ensemble model. An ensemble model represents a pipeline of one or more models and the connection of input and output tensors between those models. Here we use the ensemble model to build a pipeline of data preprocessing in the Python backend followed by XGBoost in the FIL backend.
The expected layout for the ensemble
model directory is similar to the ones we showed previously:
We created the ensemble model’s config.pbtxt following the guidance in Ensemble Models. Importantly, we need to set up the ensemble scheduler in config.pbtxt
, which specifies the data flow between models within the ensemble. The ensemble scheduler collects the output tensors in each step, and provides them as input tensors for other steps according to the specification.
Finally, we end up with the following model repository directory structure, containing a Python preprocessing model and its dependencies along with the XGBoost FIL model and the model ensemble.
We package the directory and its contents as model.tar.gz
for uploading to Amazon Simple Storage Service (Amazon S3). We have two options in this example: using a CPU-based instance or a GPU-based instance. A GPU-based instance is more suitable when you need higher processing power and want to use CUDA cores.
Create and upload the model package for a CPU-based instance (optimized for CPU) with the following code:
Create and upload the model package for a GPU-based instance (optimized for GPU) with the following code:
We now have the model artifacts stored in an S3 bucket. In this step, we can also provide the additional environment variable SAGEMAKER_TRITON_DEFAULT_MODEL_NAME
, which specifies the name of the model to be loaded by Triton. The value of this key should match the folder name in the model package uploaded to Amazon S3. This variable is optional in the case of a single model. In the case of ensemble models, this key has to be specified for Triton to start up in SageMaker.
Additionally, you can set SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT
and SAGEMAKER_TRITON_THREAD_COUNT
for optimizing the thread counts.
We use the preceding model to create an endpoint configuration where we can specify the type and number of instances we want in the endpoint
We use this endpoint configuration to create a SageMaker endpoint and wait for the deployment to finish. With SageMaker MMEs, we have the option to host multiple ensemble models by repeating this process, but we stick with one deployment for this example:
The status will change to InService
when the deployment is successful.
After the endpoint is running, we can use some sample raw data to perform inference using JSON as the payload format. For the inference request format, Triton uses the KFServing
community standard inference protocols. See the following code:
The notebook referred in the blog can be found in the GitHub repository.
In addition to the options to fine-tune the settings of the FIL backend we mentioned earlier, data scientists can also ensure that the input data for the backend is optimized for processing by the engine. Whenever possible, input data in row-major format into the GPU array. Other formats will require internal conversion and take up cycles, decreasing performance.
Due to the way FIL data structures are maintained in GPU memory, be mindful of the tree depth. The deeper the tree depth, the larger your GPU memory footprint will be.
Use the instance_group_count
parameter to add worker processes and increase the throughput of the FIL backend, which will result in larger CPU and GPU memory consumption. In addition, consider SageMaker-specific variables that are available to increase the throughput, such as HTTP threads, HTTP buffer size, batch size, and max delay.
In this post, we dove deep into the FIL backend that Triton Inference Server supports on SageMaker. This backend provides for both CPU and GPU acceleration of your tree-based models such as the popular XGBoost algorithm. There are many options to consider to get the best performance for inference, such as batch sizes, data input formats, and other factors that can be tuned to meet your needs. SageMaker allows you to use this capability with single and multi-model endpoints to balance of performance and cost savings.
We encourage you to take the information in this post and see if SageMaker can meet your hosting needs to serve tree-based models, meeting your requirements for cost reduction and workload performance.
The notebook referenced in this post can be found in the SageMaker examples GitHub repository. Furthermore, you can find the latest documentation on the FIL backend on GitHub.
The report The economic potential of generative AI: The next productivity frontier, published by McKinsey…
The new model shows open-source closing in on closed-source models, suggesting reduced chances of one…
Samsung’s celebrated flagship soundbar does just enough to beat out the rest of its Dolby…
Even highly realistic androids can cause unease when their facial expressions lack emotional consistency. Traditionally,…
These beard tools deliver a quality trim for all types of facial hair.
Artificial intelligence (AI) research, particularly in the machine learning (ML) domain, continues to increase the…