Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, substantially shrinking the KV cache size at inference time. By factorizing these representations into contextual low-rank components and seamlessly integrating with Rotary Position Embedding (RoPE), TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation on language modeling tasks, we demonstrate that T6 surpasses or matches the performance of standard Transformer baselines including Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Multi-Head Latent Attention (MLA) across various metrics, including perplexity and a range of established evaluation benchmarks. Notably, TPA's memory efficiency and computational efficiency at decoding stage enables processing longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. Project Page: https://github.com/tensorgi/TPA.
翻译:扩展语言模型以处理更长的输入序列通常需要大型键值(KV)缓存,导致推理过程中产生显著的内存开销。本文提出张量积注意力(TPA),这是一种新颖的注意力机制,利用张量分解紧凑地表示查询、键和值,从而在推理时大幅压缩KV缓存大小。通过将这些表示分解为上下文低秩分量,并与旋转位置嵌入(RoPE)无缝集成,TPA在实现内存效率的同时提升了模型质量。基于TPA,我们引入了张量积注意力Transformer(T6),这是一种用于序列建模的新模型架构。通过在语言建模任务上进行广泛的实证评估,我们证明T6在包括困惑度和一系列既定评估基准在内的多种指标上,超越或匹配了标准Transformer基线的性能,这些基线包括多头注意力(MHA)、多查询注意力(MQA)、分组查询注意力(GQA)和多头潜在注意力(MLA)。值得注意的是,TPA在解码阶段的内存效率和计算效率使其能够在固定资源约束下处理更长的序列,解决了现代语言模型中的一个关键可扩展性挑战。项目页面:https://github.com/tensorgi/TPA。