Across all industries, machine learning (ML) models are getting deeper, workflows are getting more complex, and workloads are operating at larger scales. Significant effort and resources are put into making these models more accurate since this investment directly results in better products and experiences. On the other hand, making these models run efficiently in production is a non-trivial undertaking that’s often overlooked, despite being key to achieving performance and budget goals. In this post we cover how Exafunction and AWS Inferentia work together to unlock easy and cost-efficient deployment for ML models in production.
Exafunction is a start-up focused on enabling companies to perform ML at scale as efficiently as possible. One of their products is ExaDeploy, an easy-to-use SaaS solution to serve ML workloads at scale. ExaDeploy efficiently orchestrates your ML workloads across mixed resources (CPU and hardware accelerators) to maximize resource utilization. It also takes care of auto scaling, compute colocation, network issues, fault tolerance, and more, to ensure efficient and reliable deployment. AWS Inferentia-based Amazon EC2 Inf1 instances are purpose built to deliver the lowest cost-per-inference in the cloud. ExaDeploy now supports Inf1 instances, which allows users to get both the hardware-based savings of accelerators and the software-based savings of optimized resource virtualization and orchestration at scale.
To ensure efficient utilization of compute resources, you need to consider proper resource allocation, auto scaling, compute co-location, network cost and latency management, fault tolerance, versioning and reproducibility, and more. At scale, any inefficiencies materially affect costs and latency, and many large companies have addressed these inefficiencies by building internal teams and expertise. However, it’s not practical for most companies to assume this financial and organizational overhead of building generalizable software that isn’t the company’s desired core competency.
ExaDeploy is designed to solve these deployment efficiency pain points, including those seen in some of the most complex workloads such as those in Autonomous Vehicle and natural language processing (NLP) applications. On some large batch ML workloads, ExaDeploy has reduced costs by over 85% without sacrificing on latency or accuracy, with integration time as low as one engineer-day. ExaDeploy has been proven to auto scale and manage thousands of simultaneous hardware accelerator resource instances without any system degradation.
Key features of ExaDeploy include:
AWS Inferentia-based Amazon EC2 Inf1 instances are designed for deep learning specific inference workloads. These instances provide up to 2.3x throughput and up to 70% cost saving compared to the current generation of GPU inference instances.
ExaDeploy now supports AWS Inferentia, and together they unlock the increased performance and cost-savings achieved through purpose-built hardware-acceleration and optimized resource orchestration at scale. Let’s look at the combined benefits of ExaDeploy and AWS Inferentia by considering a very common modern ML workload: batched, mixed-compute workloads.
Hypothetical workload characteristics:
The table below shows how each of the options shape up:
Setup | Resources needed | Cost | Latency |
GPU without ExaDeploy | 2 CPU, 2 GPU per client (total 20 CPU, 20 GPU) | 100 | 30 ms |
GPU with ExaDeploy | 8 GPUs shared across 10 clients, 1 CPU per client | 42 | 30 ms |
AWS Inferentia without ExaDeploy | 1 CPU, 1 AWS Inferentia per client (total 10 CPU, 10 Inferentia) | 30 | 20 ms |
AWS Inferentia with ExaDeploy | 3 AWS Inferentia shared across 10 clients, 1 CPU per client | 16 | 20 ms |
In this section, we go over the steps to configure ExaDeploy through an example with inf1 nodes on a BERT PyTorch model. We saw an average throughput of 1140 samples/sec for the bert-base model, which demonstrates that little to no overhead was introduced by ExaDeploy for this single model, single workload scenario.
Step 1: Set up an Amazon Elastic Kubernetes Service (Amazon EKS) cluster
An Amazon EKS cluster can be brought up with our Terraform AWS module. For our example, we used an inf1.xlarge
for AWS Inferentia.
Step 2: Set up ExaDepoy
The second step is to set up ExaDeploy. In general, the deployment of ExaDeploy on inf1 instances is straightforward. Setup mostly follows the same procedure as it does on graphics processing unit (GPU) instances. The primary difference is to change the model tag from GPU to AWS Inferentia and recompile the model. For example, moving from g4dn to inf1 instances using ExaDeploy’s application programming interfaces (APIs) required only approximately 10 lines of code to be changed.
ExaDeploy-specific
). AWS Inferentia is pushing the boundaries of throughput for model inference and delivering lowest cost-per-inference in the cloud. That being said, companies need the proper orchestration to enjoy the price-performance benefits of Inf1 at scale. ML serving is a complex problem that, if addressed in-house, requires expertise that’s removed from company goals and often delays product timelines. ExaDeploy, which is Exafunction’s ML deployment software solution, has emerged as the industry leader. It serves even the most complex ML workloads, while providing smooth integration experiences and support from a world-class team. Together, ExaDeploy and AWS Inferentia unlock increased performance and cost-savings for inference workloads at scale.
In this post, we showed you how Exafunction supports AWS Inferentia for performance ML. For more information on building applications with Exafunction, visit Exafunction. For best practices on building deep learning workloads on Inf1, visit Amazon EC2 Inf1 instances.
Nicholas Jiang, Software Engineer, Exafunction
Jonathan Ma, Software Engineer, Exafunction
Prem Nair, Software Engineer, Exafunction
Anshul Ramachandran, Software Engineer, Exafunction
Shruti Koparkar, Sr. Product Marketing Manager, AWS
Our next iteration of the FSF sets out stronger security protocols on the path to…
Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this…
Generative AI has revolutionized technology through generating content and solving complex problems. To fully take…
At Google Cloud, we're deeply invested in making AI helpful to organizations everywhere — not…
Advanced Micro Devices reported revenue of $7.658 billion for the fourth quarter, up 24% from…