Categories: FAANG

Accelerating LLM Inference on NVIDIA GPUs with ReDrafter

Accelerating LLM inference is an important ML research problem, as auto-regressive token generation is computationally expensive and relatively slow, and improving inference efficiency can reduce latency for users. In addition to ongoing efforts to accelerate inference on Apple silicon, we have recently made significant progress in accelerating LLM inference for the NVIDIA GPUs widely used for production applications across the industry.
Earlier this year, we published and open sourced Recurrent Drafter (ReDrafter), a novel approach to speculative decoding that achieves state of the art…

Optimize price-performance of LLM inference on NVIDIA GPUs using the Amazon SageMaker integration with NVIDIA NIM Microservices

NVIDIA NIM microservices now integrate with Amazon SageMaker, allowing you to deploy industry-leading large language models (LLMs) and optimize model performance and cost. You can deploy state-of-the-art LLMs in minutes instead of days using technologies such as NVIDIA TensorRT, NVIDIA TensorRT-LLM, and NVIDIA Triton Inference Server on NVIDIA accelerated instances…

March 19, 2024

In "FAANG"

By Jove, It’s No Myth: NVIDIA Triton Speeds Inference on Oracle Cloud

An avid cyclist, Thomas Park knows the value of having lots of gears to maintain a smooth, fast ride. So, when the software architect designed an AI inference platform to serve predictions for Oracle Cloud Infrastructure’s (OCI) Vision AI service, he picked NVIDIA Triton Inference Server. That’s because it can…

January 3, 2024

In "FAANG"