Efficient long-context understanding and reasoning are increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention mechanism scales quadratically with sequence length, creating a fundamental computational bottleneck. Existing sparse attention methods alleviate this issue but face trade-offs: training-based methods are costly and cannot be directly applied as acceleration plugins for other models, while inference-time methods often compromise efficiency or cross-modal generality. To address these limitations, we present UniSparse, a unified mechanism that introduces the notion of composite tokens--compact representations that aggregate multi-granularity contextual information. Building on this abstraction, UniSparse dynamically constructs sparse attention through multi-granularity compression and block-level selection, enabling efficient and hardware-friendly execution on GPU. Across multiple modalities and tasks ranging from synthetic benchmarks to real-world applications, UniSparse consistently surpasses state-of-the-art sparse attention methods (e.g., MInference, XAttention, FlexPrefill) in both accuracy and efficiency, achieving $\ge$ 99% of full-attention accuracy and up to 2.61$\times$ faster attention computation than FlashAttention.
翻译:高效的长上下文理解与推理对于大语言模型(LLM)应用(如多轮对话和程序分析)日益重要。然而,核心的自注意力机制的计算复杂度随序列长度呈二次方增长,构成了根本性的计算瓶颈。现有的稀疏注意力方法虽能缓解此问题,但面临权衡:基于训练的方法成本高昂,且无法直接作为加速插件应用于其他模型;而推理时方法则常常在效率或跨模态通用性上有所妥协。为应对这些局限,我们提出了UniSparse,一种统一机制,引入了复合令牌的概念——即聚合多粒度上下文信息的紧凑表示。基于此抽象,UniSparse通过多粒度压缩和块级选择动态构建稀疏注意力,从而在GPU上实现高效且硬件友好的执行。在从合成基准到实际应用的多种模态和任务中,UniSparse在准确性和效率上均持续超越最先进的稀疏注意力方法(例如MInference、XAttention、FlexPrefill),达到≥99%的全注意力精度,且注意力计算速度比FlashAttention快达2.61倍。