Speculative Streaming: Fast LLM Inference Without Auxiliary Models
This paper was accepted at the Efficient Natural Language and Speech Processing (ENLSP) workshop at NeurIPS 2024. Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target …
Read more “Speculative Streaming: Fast LLM Inference Without Auxiliary Models”