一种基于多粒度压缩的统一稀疏注意力机制 (A Unified Sparse Attention via Multi-Granularity Compression)

Efficient long-context understanding and reasoning are increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention mechanism scales quadratically with sequence length, creating a fundamental computational bottleneck. Existing sparse attention methods alleviate this issue but face trade-offs: training-based methods are costly and cannot be directly applied as acceleration plugins for other models, while inference-time methods often compromise efficiency or cross-modal generality. To address these limitations, we present UniSparse, a unified mechanism that introduces the notion of composite tokens--compact representations that aggregate multi-granularity contextual information. Building on this abstraction, UniSparse dynamically constructs sparse attention through multi-granularity compression and block-level selection, enabling efficient and hardware-friendly execution on GPU. Across multiple modalities and tasks ranging from synthetic benchmarks to real-world applications, UniSparse consistently surpasses state-of-the-art sparse attention methods (e.g., MInference, XAttention, FlexPrefill) in both accuracy and efficiency, achieving $\ge$ 99% of full-attention accuracy and up to 2.61$\times$ faster attention computation than FlashAttention.

翻译：高效的长上下文理解与推理对于大语言模型（LLM）应用（如多轮对话和程序分析）日益重要。然而，核心的自注意力机制的计算复杂度随序列长度呈二次方增长，构成了根本性的计算瓶颈。现有的稀疏注意力方法虽能缓解此问题，但面临权衡：基于训练的方法成本高昂，且无法直接作为加速插件应用于其他模型；而推理时方法则常常在效率或跨模态通用性上有所妥协。为应对这些局限，我们提出了UniSparse，一种统一机制，引入了复合令牌的概念——即聚合多粒度上下文信息的紧凑表示。基于此抽象，UniSparse通过多粒度压缩和块级选择动态构建稀疏注意力，从而在GPU上实现高效且硬件友好的执行。在从合成基准到实际应用的多种模态和任务中，UniSparse在准确性和效率上均持续超越最先进的稀疏注意力方法（例如MInference、XAttention、FlexPrefill），达到≥99%的全注意力精度，且注意力计算速度比FlashAttention快达2.61倍。