Categories: FAANG

Speculative Streaming: Fast LLM Inference Without Auxiliary Models

This paper was accepted at the Efficient Natural Language and Speech Processing (ENLSP) workshop at NeurIPS 2024.
Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method…
AI Generated Robotic Content

Recent Posts

Image upscale with Klein 9B

Prompt: upscale image and remove jpeg compression artifacts. Added few hours later: Please note that…

7 hours ago

KV Caching in LLMs: A Guide for Developers

Language models generate text one token at a time, reprocessing the entire sequence at each…

7 hours ago

Learnings from COBOL modernization in the real world

There’s a lot of excitement right now about AI enabling mainframe application modernization. Boards are…

7 hours ago

PayPal’s historically large data migration is the foundation for its gen AI innovation

With the dawn of the gen AI era, businesses are facing unprecedented opportunities for transformative…

7 hours ago

The Latest Repair Battlefield Is the Iowa Farmlands—Again

A new bill that would give farmers in Iowa the right to repair is a…

8 hours ago

Adaptive drafter model uses downtime to double LLM training speed

Reasoning large language models (LLMs) are designed to solve complex problems by breaking them down…

8 hours ago