We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. We first establish a reliable evaluation protocol to guide model development -- instead of perplexity or simple needle-in-a-haystack (NIAH) tests, we use a broad set of long-context downstream tasks, and we evaluate models after SFT as this better reveals long-context abilities. Supported by our robust evaluations, we run thorough experiments to decide the data mix for continued pre-training, the instruction tuning dataset, and many other design choices such as position extrapolation. We find that (1) code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short-context data; (2) training with a sequence length beyond the evaluation length boosts long-context performance; (3) for SFT, using only short instruction datasets yields strong performance on long-context tasks. Our final model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K. ProLong outperforms Llama-3.1-8B-Instruct on the majority of long-context tasks despite using only 5% as many tokens during long-context training. Additionally, ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.
翻译:我们研究了语言模型的持续训练与监督微调,旨在使其有效利用长上下文信息。首先,我们建立了一个可靠的评估方案来指导模型开发——不使用困惑度或简单的“大海捞针”测试,而是采用一组广泛的长上下文下游任务,并在监督微调后评估模型,因为这能更好地揭示其长上下文能力。在我们稳健评估的支持下,我们进行了全面的实验,以确定持续预训练的数据组合、指令调优数据集,以及位置外推等诸多设计选择。我们发现:(1)代码仓库和书籍是优秀的长数据来源,但将其与高质量短上下文数据结合至关重要;(2)使用超出评估长度的序列进行训练能提升长上下文性能;(3)对于监督微调,仅使用短指令数据集即可在长上下文任务上获得强劲表现。我们的最终模型ProLong-8B,基于Llama-3初始化并在40B token上训练,在128K长度下,在同等规模模型中展现出最先进的长上下文性能。尽管在长上下文训练中仅使用了5%的token,ProLong在大多数长上下文任务上仍优于Llama-3.1-8B-Instruct。此外,ProLong能有效处理长达512K的token,这是当前公开语言模型中最长的上下文窗口之一。