Domain-adaptive pretraining (DAPT) offers a practical path to specializing large language models for high-value domains without full retraining. We conduct an early-stage scaling-law analysis of continued pretraining on U.S. SEC filings, training 1B and 3B-parameter Llama-3.2 models on a 400M-token financial corpus with validation checkpoints at 50M, 100M, 200M, and 400M tokens. Results show consistent improvements in SEC-domain validation loss for both models, with the largest gains occurring within the first 200M tokens and diminishing returns thereafter. Power-law fits reveal shallow exponents, indicating that financial language is highly regular and efficiently learnable under continued pretraining. General-domain validation loss remains effectively unchanged across all token budgets, suggesting minimal drift and no signs of catastrophic forgetting. A data-efficiency frontier further shows that both models move toward improved specialization with negligible mixed-domain degradation. Together, these findings provide early empirical guidance for scaling financial foundation models, suggesting that meaningful domain adaptation can be achieved with comparatively modest token budgets and that larger model scales (7B-70B) remain tractable under projected data requirements.
翻译:领域自适应预训练(DAPT)为将大型语言模型专业化应用于高价值领域提供了一条无需完全重新训练的实际路径。我们对美国证券交易委员会(SEC)文件上的持续预训练进行了早期缩放定律分析,使用包含4亿个标记的金融语料库训练了10亿参数和30亿参数的Llama-3.2模型,并在5000万、1亿、2亿和4亿标记处设置了验证检查点。结果显示,两个模型在SEC领域验证损失上均取得持续改善,最大增益出现在前2亿标记内,此后收益递减。幂律拟合揭示了较浅的指数,表明金融语言在持续预训练下具有高度规律性且可高效学习。通用领域验证损失在所有标记预算下基本保持不变,表明模型漂移极小且未出现灾难性遗忘迹象。数据效率前沿进一步表明,两个模型在混合领域性能退化可忽略不计的情况下均朝着改进的专业化方向发展。综合来看,这些发现为金融基础模型的规模化提供了早期实证指导,表明通过相对适中的标记预算即可实现有意义的领域自适应,且更大规模的模型(70亿至700亿参数)在预估数据需求下仍具有可行性。