Latency ML12493
This is a guest blog post co-written with Minghui Yu and Jianzhe Xiao from Bytedance.
ByteDance is a technology company that operates a range of content platforms to inform, educate, entertain, and inspire people across languages, cultures, and geographies. Users trust and enjoy our content platforms because of the rich, intuitive, and safe experiences they provide. These experiences are made possible by our machine learning (ML) backend engine, with ML models built for content moderation, search, recommendation, advertising, and novel visual effects.
The ByteDance AML (Applied Machine Learning) team provides highly performant, reliable, and scalable ML systems and end-to-end ML services for the company’s business. We were researching ways to optimize our ML inference systems to reduce costs, without increasing response times. When AWS launched AWS Inferentia, a high-performance ML inference chip purpose-built by AWS, we engaged with our AWS account team to test if AWS Inferentia can address our optimization goals. We ran several proofs of concept, resulting in up to 60% lower inference cost compared to T4 GPU-based EC2 G4dn instances and up to 25% lower inference latency. To realize these cost savings and performance improvements, we decided to deploy models on AWS Inferentia-based Amazon Elastic Compute Cloud (Amazon EC2) Inf1 instances in production.
The following chart shows the latency improvement for one of our face detection models that was previously deployed on GPUs with Tensor RT. The average latency decreased by 20% (from 50 milliseconds to 40 milliseconds), and the p99 latency decreased by 25% (from 200 milliseconds to 150 milliseconds).
In this post, we share how we saved on inference costs while reducing latencies and increasing throughput using AWS Inferentia.
The ByteDance AML team focuses on the research and implementation of cutting-edge ML systems and the heterogenous computing resources they require. We create large-scale training and inference systems for a wide variety of recommender, natural language processing (NLP), and computer vision (CV) models. These models are highly complex and process a huge amount of data from the many content platforms ByteDance operates. Deploying these models requires significant GPU resources, whether in the cloud or on premises. Therefore, the compute costs for these inference systems are quite high.
We were looking to lower these costs without impacting throughput or latency. We wanted the cloud’s flexibility and faster delivery cycle, which is much shorter than the one needed for an on-premises setup. And although we were open to exploring new options for accelerated ML, we also wanted a seamless developer experience.
We learned from our AWS team that AWS Inferentia-based EC2 Inf1 instances deliver high-performance ML inference at the lowest cost-per-inference in the cloud. We were curious to explore them and found them to be well-suited to our use case, because we run substantial machine learning on large amounts of image, object, speech, and text data. They were definitely a good fit for our goals, because we could realize huge cost savings given the complexity of our models and volume of daily predictions. Furthermore, AWS Inferentia features a large amount of on-chip memory, which you can use for caching large models instead of storing them off chip. We recognized that this can have a significant impact in reducing inference latency because the processing cores of AWS Inferentia, called NeuronCores, have high-speed access to models that are stored in on-chip memory and aren’t limited by the off-chip memory bandwidth.
Ultimately, after evaluating several options, we chose EC2 Inf1 instances for their better performance/price ratio compared to G4dn instances and NVIDIA T4 on premises. We engaged in a cycle of continuous iteration with the AWS team to unlock the price and performance benefits of Inf1.
Getting started with AWS Inferentia using the AWS Neuron SDK involved two phases: compilation of model code and deployment on Inf1 instances. As is common when moving ML models to any new infrastructure, there were some challenges that we faced. We were able to overcome these challenges with diligence and support from our AWS team. In the following sections, we share several useful tips and observations based on our experience deploying inference workloads on AWS Inferentia.
Our optical character recognition (OCR) conformer model detects and reads text within images. We worked on several optimizations to get high performance (QPS) for a variety of batch sizes, while keeping the latency low. Some key optimizations are noted below:
We created two new model variants to optimize our deployment on Inferentia:
To further improve performance, we made the following optimizations to reduce memory usage or improve access efficiency:
ResNet-50 is a pre-trained deep learning model for image classification. It is a Convolutional Neural Network (CNN or ConvNet) that is most commonly applied to analyzing visual imagery. We used the following techniques to improve this model’s performance on Inferentia:
Our multi-modal deep learning model is a combination of multiple separate models. The size of this model is relatively large, which caused model loading failures on Inferentia. The AWS Neuron team successfully solved this problem by using weight sharing to reduce the device memory usage. The Neuron team released this weight de-duplication feature in the Neuron libnrt library and also improved Neuron Tools for more precise metrics. The runtime weight de-duplication feature can be enabled by setting the following environment variable before running inference:
NEURON_RT_MULTI_INSTANCE_SHARED_WEIGHTS=1
The updated Neuron SDK reduced the overall memory consumption of our duplicated models, which enabled us to deploy our multi-modal model for multi-core inference.
At ByteDance, we continue to deploy innovative deep learning models to deliver delightful user experiences to almost 2 billion monthly active users. Given the massive scale at which we operate, we’re constantly looking for ways to save costs and optimize performance. We will continue to migrate models to AWS Inferentia to benefit from its high performance and cost-efficiency. We also want AWS to launch more AWS Inferentia-based instance types, such as ones with more vCPUs for preprocessing tasks. Going forward, ByteDance is hoping to see more silicon innovation from AWS to deliver the best price performance for ML applications.
If you’re interested in learning more about how AWS Inferentia can help you save costs while optimizing performance for your inference applications, visit the Amazon EC2 Inf1 instances product page.
What models or workflows are people using to generate these? submitted by /u/danikcara [link] [comments]
Ever felt like trying to find a needle in a haystack? That’s part of the…
We propose a distillation scaling law that estimates distilled model performance based on a compute…
How Alberta Health Services is using advanced AI to bolster its defenses as attackers increasingly…
A week after Disney and Universal filed a landmark lawsuit against Midjourney, the generative AI…
Imagine supercomputers that think with light instead of electricity. That s the breakthrough two European…