穿越河流：理解无调度方法在语言模型训练中的优势 (Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training)

As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this behavior, we conduct a theoretical and empirical analysis of SF dynamics, revealing that it implicitly performs weight averaging without memory overhead. Guided by this analysis, we propose a refined variant of SF that improves robustness to momentum and performs better under large batch sizes, addressing key limitations of the original method. Together, these results establish SF as a practical, scalable, and theoretically grounded approach for language model training.

翻译：随着模型与数据集规模持续快速扩张，采用固定计算预算的传统预训练策略（如余弦学习率调度）在大规模训练中日益显得不足。近期替代方案，包括预热-稳定-衰减（WSD）调度和权重平均，提供了更高的灵活性。然而，WSD依赖显式的衰减阶段来追踪训练进展，而权重平均虽解决了这一局限，却以额外的内存开销为代价。为寻求更具原则性且可扩展的替代方案，我们重新审视了无调度（SF）方法[Defazio等人，2024]，该方法已在多种场景下展现出强大的实证性能。我们证明，SF-AdamW能够有效穿越损失景观的“河流”结构，无需衰减阶段或辅助平均，这使其特别适用于持续扩展的训练任务。为理解这一行为，我们对SF的动态特性进行了理论与实证分析，揭示其隐式执行权重平均且无内存开销。基于此分析，我们提出了一种改进的SF变体，该变体增强了对动量参数的鲁棒性，并在大批量训练下表现更优，从而解决了原始方法的关键局限。综上，这些结果确立了SF作为一种实用、可扩展且理论依据充分的语言模型训练方法。

相关内容

MoDELS

关注 0

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日