SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models
With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop …
Read more “SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models”