Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models
This paper was accepted at the Efficient Natural Language and Speech Processing (ENLSP) Workshop at NeurIPS 2024. Large Language Models (LLMs) typically generate outputs token by token using a fixed compute budget, leading to inefficient resource utilization. To address this shortcoming, recent advancements in mixture of expert (MoE) models, speculative decoding, and early exit strategies …
Read more “Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models”