As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this behavior, we conduct a theoretical and empirical analysis of SF dynamics, revealing that it implicitly performs weight averaging without memory overhead. Guided by this analysis, we propose a refined variant of SF that improves robustness to momentum and performs better under large batch sizes, addressing key limitations of the original method. Together, these results establish SF as a practical, scalable, and theoretically grounded approach for language model training.
翻译:随着模型与数据集规模持续快速扩张,采用固定计算预算的传统预训练策略(如余弦学习率调度)在大规模训练中日益显得不足。近期替代方案,包括预热-稳定-衰减(WSD)调度和权重平均,提供了更高的灵活性。然而,WSD依赖显式的衰减阶段来追踪训练进展,而权重平均虽解决了这一局限,却以额外的内存开销为代价。为寻求更具原则性且可扩展的替代方案,我们重新审视了无调度(SF)方法[Defazio等人,2024],该方法已在多种场景下展现出强大的实证性能。我们证明,SF-AdamW能够有效穿越损失景观的“河流”结构,无需衰减阶段或辅助平均,这使其特别适用于持续扩展的训练任务。为理解这一行为,我们对SF的动态特性进行了理论与实证分析,揭示其隐式执行权重平均且无内存开销。基于此分析,我们提出了一种改进的SF变体,该变体增强了对动量参数的鲁棒性,并在大批量训练下表现更优,从而解决了原始方法的关键局限。综上,这些结果确立了SF作为一种实用、可扩展且理论依据充分的语言模型训练方法。