注意力跨度混合：通过异构滑动窗口长度优化大语言模型推理效率 (Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths)

Sliding-window attention offers a hardware-efficient solution to the memory and throughput challenges of Large Language Models (LLMs) in long-context scenarios. Existing methods typically employ a single window length across all attention heads and input sizes. However, this uniform approach fails to capture the heterogeneous attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose *Mixture of Attention Spans* (MoA), which automatically tailors distinct sliding-window length configurations to different heads and layers. MoA constructs and navigates a search space of various window lengths and their scaling rules relative to input sizes. It profiles the model, evaluates potential configurations, and pinpoints the optimal length configurations for each head. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer inputs, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by 3.9x with the same average sliding-window length, boosting retrieval accuracy by 1.5-7.1x over the uniform-window baseline across Vicuna-{7B, 13B} and Llama3-{8B, 70B} models. Moreover, MoA narrows the performance gap with full attention, reducing the maximum relative performance drop from 9%-36% to within 5% across three long-context understanding benchmarks. MoA achieves a 1.2-1.4x GPU memory reduction, boosting decode throughput by 6.6-8.2x and 1.7-1.9x over FlashAttention2 and vLLM, with minimal performance impact. Our code is available at: https://github.com/thu-nics/MoA

翻译：滑动窗口注意力机制为长上下文场景下大语言模型（LLMs）面临的内存与吞吐量挑战提供了一种硬件高效的解决方案。现有方法通常在所有注意力头和输入尺寸上采用统一的窗口长度。然而，这种均质化方法未能捕捉LLMs内在的异构注意力模式，忽略了其不同的精度-延迟权衡特性。为解决这一问题，我们提出*注意力跨度混合*（MoA）方法，该方法能自动为不同注意力头及网络层定制差异化的滑动窗口长度配置。MoA构建并探索了多种窗口长度及其相对于输入尺寸的缩放规则所组成的搜索空间，通过模型性能剖析、潜在配置评估，精确定位每个注意力头的最优长度配置。MoA能自适应变化的输入尺寸，揭示出部分注意力头会扩展关注范围以适应更长输入，而另一些注意力头则持续聚焦于固定长度的局部上下文。实验表明，在保持相同平均滑动窗口长度的条件下，MoA将有效上下文长度提升3.9倍，在Vicuna-{7B, 13B}和Llama3-{8B, 70B}模型上，其检索准确率较均质窗口基线提升1.5-7.1倍。此外，MoA显著缩小了与全注意力机制的性能差距，在三个长上下文理解基准测试中，将最大相对性能下降从9%-36%缩减至5%以内。MoA实现了1.2-1.4倍的GPU内存降低，解码吞吐量较FlashAttention2和vLLM分别提升6.6-8.2倍和1.7-1.9倍，且对模型性能影响极小。代码已开源：https://github.com/thu-nics/MoA