The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba's performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behaviour of its state-space dynamics, particularly within the parameterization of the state transition matrix $\mathbf{A}$. Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, $\exp(-\sum_{t=1}^NΔ_t)$, we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix $\mathbf{A}$, offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of $\mathbf{A}$ matrices in each layer. We show that this can significantly improve performance in settings where simply modulating $Δ_t$ fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices.
翻译:Transformer模型中注意力机制的二次复杂度推动了具有次二次缩放特性的替代架构的发展,例如状态空间模型。其中,Mamba已成为一种领先的架构,在一系列语言建模任务中取得了最先进的结果。然而,当应用于预训练期间未见过的更长上下文时,Mamba的性能显著下降,显示出对上下文长度扩展的强烈敏感性。通过详细分析,我们将此限制归因于其状态空间动力学的分布外行为,特别是在状态转移矩阵$\mathbf{A}$的参数化中。与近期将这种敏感性归因于离散化时间步长累积消失($\exp(-\sum_{t=1}^NΔ_t)$)的研究不同,我们建立了输入长度趋近无穷时状态收敛行为与转移矩阵$\mathbf{A}$谱之间的联系,为其在长度扩展中的作用提供了有根据的解释。接下来,为克服这一挑战,我们提出了一种方法,通过对预训练的Mamba模型应用谱缩放,通过选择性调制每层中$\mathbf{A}$矩阵的谱来实现稳健的长上下文泛化。我们证明,在仅调制$Δ_t$会失败的场景中,此方法能显著提升性能,从而验证了我们的见解,并为具有结构化转移矩阵的状态空间模型提供了更好的长度泛化途径。