One persistent challenge in LLM research is the development of attention mechanisms that are able to generalise from training on shorter contexts to inference on longer contexts. We propose two conditions that we expect all effective long context attention mechanisms to have: scale-invariant total attention, and scale-invariant attention sparsity. Under a Gaussian assumption, we show that a simple position-dependent transformation of the attention logits is sufficient for these conditions to hold. Experimentally we find that the resulting scale-invariant attention scheme gives considerable benefits in terms of validation loss when zero-shot generalising from training on short contexts to validation on longer contexts, and is effective at long-context retrieval.
翻译:大型语言模型研究中一个持续存在的挑战是开发能够从短上下文训练泛化至长上下文推理的注意力机制。我们提出了两个条件,预期所有有效的长上下文注意力机制都应具备:尺度不变的总注意力与尺度不变的注意力稀疏性。在高斯分布假设下,我们证明通过对注意力对数进行简单的位置相关变换即可满足这些条件。实验结果表明,该尺度不变注意力方案在从短上下文训练零样本泛化至长上下文验证时,能在验证损失方面带来显著优势,并有效实现长上下文检索。