Accelerating LLM Inference on NVIDIA GPUs with ReDrafter
Accelerating LLM inference is an important ML research problem, as auto-regressive token generation is computationally expensive and relatively slow, and improving inference efficiency can reduce latency for users. In addition to ongoing efforts to accelerate inference on Apple silicon, we have recently made significant progress in accelerating LLM inference for the NVIDIA GPUs widely used for production applications across the industry. Earlier this year, we published and open sourced Recurrent Drafter (ReDrafter), a novel approach to speculative decoding that achieves state of the art…
NVIDIA NIM microservices now integrate with Amazon SageMaker, allowing you to deploy industry-leading large language models (LLMs) and optimize model performance and cost. You can deploy state-of-the-art LLMs in minutes instead of days using technologies such as NVIDIA TensorRT, NVIDIA TensorRT-LLM, and NVIDIA Triton Inference Server on NVIDIA accelerated instances…
An avid cyclist, Thomas Park knows the value of having lots of gears to maintain a smooth, fast ride. So, when the software architect designed an AI inference platform to serve predictions for Oracle Cloud Infrastructure’s (OCI) Vision AI service, he picked NVIDIA Triton Inference Server. That’s because it can…
NVIDIA, in collaboration with Google, today launched optimizations across all NVIDIA AI platforms for Gemma — Google’s state-of-the-art new lightweight 2 billion– and 7 billion-parameter open language models that can be run anywhere, reducing costs and speeding innovative work for domain-specific use cases. Teams from the companies worked closely together…