稀疏注意力的涌现：数据分布的影响与重复训练的益处 (The emergence of sparse attention: impact of data distribution and benefits of repetition)

Emergence is a fascinating property of large language models and neural networks more broadly: as models scale and train for longer, they sometimes develop new abilities in sudden ways. Despite initial studies, we still lack a comprehensive understanding of how and when these abilities emerge. To address this gap, we study the emergence over training of sparse attention, a critical and frequently observed attention pattern in Transformers. By combining theoretical analysis of a toy model with empirical observations on small Transformers trained on a linear regression variant, we uncover the mechanics driving sparse attention emergence and reveal that emergence timing follows power laws based on task structure, architecture, and optimizer choice. We additionally find that repetition can greatly speed up emergence. Finally, we confirm these results on a well-studied in-context associative recall task. Our findings provide a simple, theoretically grounded framework for understanding how data distributions and model design influence the learning dynamics behind one form of emergence.

翻译：涌现是大语言模型乃至更广泛神经网络的一个引人入胜的特性：随着模型规模扩大和训练时间延长，它们有时会以突发方式展现出新的能力。尽管已有初步研究，我们仍对这类能力如何及何时涌现缺乏全面理解。为填补这一空白，我们研究了稀疏注意力在训练过程中的涌现现象——这是Transformer模型中一种关键且频繁被观察到的注意力模式。通过结合对玩具模型的理论分析和对小型Transformer在线性回归变体任务上的实证观察，我们揭示了驱动稀疏注意力涌现的机制，并发现涌现时机遵循基于任务结构、模型架构和优化器选择的幂律关系。此外，我们发现重复训练能显著加速涌现过程。最后，我们在一个经过充分研究的上下文关联回忆任务中验证了这些结论。我们的研究结果为理解数据分布和模型设计如何影响一种涌现形式背后的学习动力学，提供了一个简洁且理论依据充分的框架。