Why do modern language models, trained to do well on next-word prediction, appear to generate coherent documents and capture long-range structure? Here we show that next-token prediction is provably powerful for learning longer-range structure, even with common neural network architectures. Specifically, we prove that optimizing next-token prediction over a Recurrent Neural Network (RNN) yields a model that closely approximates the training distribution: for held-out documents sampled from the training distribution, no algorithm of bounded description length limited to examining the next $k$ tokens, for any $k$, can distinguish between $k$ consecutive tokens of such documents and $k$ tokens generated by the learned language model following the same prefix. We provide polynomial bounds (in $k$, independent of the document length) on the model size needed to achieve such $k$-token indistinguishability, offering a complexity-theoretic explanation for the long-range coherence observed in practice.
翻译:为什么现代语言模型通过优化下一词预测任务,能够生成连贯的文档并捕获长程结构?本文证明,即使在常见的神经网络架构下,下一词预测对于学习长程结构也具有可证明的强大能力。具体而言,我们证明了在循环神经网络(RNN)上优化下一词预测所得到的模型能够紧密逼近训练分布:对于从训练分布中采样的保留文档,任何有界描述长度且仅能检查后续 $k$ 个词的算法(对于任意 $k$ 值),都无法区分该文档的连续 $k$ 个词与学习得到的语言模型在相同前缀下生成的 $k$ 个词。我们给出了实现这种 $k$ 词不可区分性所需模型规模的关于 $k$ 的多项式界(与文档长度无关),从而为实践中观察到的长程连贯性提供了计算复杂性理论上的解释。