Categories: FAANG

Speculative Streaming: Fast LLM Inference Without Auxiliary Models

This paper was accepted at the Efficient Natural Language and Speech Processing (ENLSP) workshop at NeurIPS 2024.
Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method…

Achieve up to ~2x higher throughput while reducing costs by up to ~50% for generative AI inference on Amazon SageMaker with the new inference optimization toolkit – Part 2

July 10, 2024

In "FAANG"

Amazon SageMaker AI introduces EAGLE based adaptive speculative decoding to accelerate generative AI inference

November 26, 2025

In "FAANG"

Together AI’s ATLAS adaptive speculator delivers 400% inference speedup by learning from workloads in real-time

Enterprises expanding AI deployments are hitting an invisible performance wall. The culprit? Static speculators that can't keep up with shifting workloads.Speculators are smaller AI models that work alongside large language models during inference. They draft multiple tokens ahead, which the main model then verifies in parallel. This technique (called speculative…

October 11, 2025

In "AI/ML News"

AI Generated Robotic Content