Categories: FAANG

Accelerating LLM Inference on NVIDIA GPUs with ReDrafter

Accelerating LLM inference is an important ML research problem, as auto-regressive token generation is computationally expensive and relatively slow, and improving inference efficiency can reduce latency for users. In addition to ongoing efforts to accelerate inference on Apple silicon, we have recently made significant progress in accelerating LLM inference for the NVIDIA GPUs widely used for production applications across the industry.
Earlier this year, we published and open sourced Recurrent Drafter (ReDrafter), a novel approach to speculative decoding that achieves state of the art…
AI Generated Robotic Content

Recent Posts

How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model

This post is co-written with Marta Cavalleri and Giovanni Germani from Fastweb, and Claudia Sacco…

52 seconds ago

Optimizing RAG retrieval: Test, tune, succeed

Retrieval-augmented generation (RAG) supercharges large language models (LLMs) by connecting them to real-time, proprietary, and…

55 seconds ago

NASA Postpones Return of Stranded Starliner Astronauts to March

Barry Wilmore and Suni Williams will now come home in March at the earliest, to…

1 hour ago

Swarms of ‘ant-like’ robots lift heavy objects and hurl themselves over obstacles

Scientists have developed swarms of tiny magnetic robots that work together like ants to achieve…

1 hour ago

Human-like artificial intelligence may face greater blame for moral violations

In a new study, participants tended to assign greater blame to artificial intelligences (AIs) involved…

1 hour ago

7 Machine Learning Projects For Beginners

The adoption of machine learning (ML) continues at a rapid pace, as it has proven…

1 day ago