Categories: FAANG

Microsoft Bing Speeds Ad Delivery With NVIDIA Triton

Jiusheng Chen’s team just got accelerated.

They’re delivering personalized ads to users of Microsoft Bing with 7x throughput at reduced cost, thanks to NVIDIA Triton Inference Server running on NVIDIA A100 Tensor Core GPUs.

It’s an amazing achievement for the principal software engineering manager and his crew.

Tuning a Complex System

Bing’s ad service uses hundreds of models that are constantly evolving. Each must respond to a request within as little as 10 milliseconds, about 10x faster than the blink of an eye.

The latest speedup got its start with two innovations the team delivered to make AI models run faster: Bang and EL-Attention.

Together, they apply sophisticated techniques to do more work in less time with less computer memory. Model training was based on Azure Machine Learning for efficiency.

Flying With NVIDIA A100 MIG

Next, the team upgraded the ad service from NVIDIA T4 to A100 GPUs.

The latter’s Multi-Instance GPU (MIG) feature lets users split one GPU into several instances.

Chen’s team maxed out the MIG feature, transforming one physical A100 into seven independent ones. That let the team reap a 7x throughput per GPU with inference response in 10ms.

Flexible, Easy, Open Software

Triton enabled the shift, in part, because it lets users simultaneously run different runtime software, frameworks and AI modes on isolated instances of a single GPU.

The inference software comes in a software container, so it’s easy to deploy. And open-source Triton — also available with enterprise-grade security and support through NVIDIA AI Enterprise — is backed by a community that makes the software better over time.

Accelerating Bing’s ad system with Triton on A100 GPUs is one example of what Chen likes about his job. He gets to witness breakthroughs with AI.

While the scenarios often change, the team’s goal remains the same — creating a win for its users and advertisers.

AI Generated Robotic Content

Recent Posts

This Is a Weapon of Choice (Wan2.2 Animate)

I used a workflow from here: https://github.com/IAMCCS/comfyui-iamccs-workflows/tree/main Specifically this one: https://github.com/IAMCCS/comfyui-iamccs-workflows/blob/main/C_IAMCCS_NATIVE_WANANIMATE_LONG_VIDEO_v.1.json submitted by /u/sutrik [link]…

18 hours ago

Expert-Level Feature Engineering: Advanced Techniques for High-Stakes Models

Building machine learning models in high-stakes contexts like finance, healthcare, and critical infrastructure often demands…

18 hours ago

Introducing agent-to-agent protocol support in Amazon Bedrock AgentCore Runtime

We recently announced the support for Agent-to-Agent (A2A) protocol on Amazon Bedrock AgentCore Runtime. With…

18 hours ago

BigQuery under the hood: How Google brought embeddings to analytics

Embeddings are a crucial component at the intersection of data and AI. As data structures,…

18 hours ago

Baidu just dropped an open-source multimodal AI that it claims beats GPT-5 and Gemini

Baidu Inc., China's largest search engine company, released a new artificial intelligence model on Monday…

19 hours ago

The Nike x Hyperice Hyperboot Is $200 Off

Nike’s high-end recovery sneakers are on sale—just in time for ski season.

19 hours ago