Transformers have achieved state-of-the-art results across a range of domains, but their quadratic attention mechanism poses significant challenges for long-sequence modelling. Recent efforts to design linear-time attention mechanisms have yielded more scalable alternatives, yet often at the cost of performance, particularly on discrete data such as language. In this work, we revisit linear attention through the lens of probabilistic graphical models. We first show that standard linear attention can be interpreted as an undirected latent variable model, revealing a key limitation: the absence of directionality. To address this, we propose a novel directed parameterisation of linear attention that introduces an asymmetric structure, enabling an interpretation aligned with the causal and sequential nature of language. Our formulation integrates global latent-variable attention with local standard attention in a fully probabilistic framework. Additionally, we introduce a recurrent parameterisation of queries and keys that avoids reliance on relative positional encodings, often incompatible with linear attention. Experiments on language modelling benchmarks demonstrate that our model achieves competitive performance with standard attention and outperforms existing linear attention variants.
翻译:Transformer 模型已在多个领域取得最先进的成果,但其二次复杂度的注意力机制对长序列建模构成显著挑战。近期设计线性时间注意力机制的研究提出了更具可扩展性的替代方案,但往往以性能损失为代价,尤其在语言等离散数据任务中。本研究从概率图模型的视角重新审视线性注意力机制。我们首先证明标准线性注意力可解释为无向潜变量模型,并揭示其关键局限:缺乏方向性。为解决此问题,我们提出一种新颖的有向参数化线性注意力方法,通过引入非对称结构使其与语言的因果性和序列性本质相契合。该框架在完全概率化体系中将全局潜变量注意力与局部标准注意力相融合。此外,我们提出查询向量与键向量的循环参数化方法,避免依赖常与线性注意力不兼容的相对位置编码。在语言建模基准测试上的实验表明,本模型在保持与标准注意力相当性能的同时,显著优于现有线性注意力变体。