Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule's influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs -- constant, exponential decay, and warmup-stable-decay (WSD) -- and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.
翻译:标度律已成为理解和指导大语言模型(LLM)训练的统一视角。然而,现有研究主要关注最终步损失,对于整个损失动力学是否遵循类似规律,以及学习率调度(LRS)如何影响这些规律,仍存在空白。我们在一个受控的理论环境中,通过分析幂律核回归模型上的随机梯度下降(SGD)来填补这些空白。关键见解是一种新颖的“内在时间”视角,它比迭代次数更准确地捕捉训练进展。随后,我们建立了一个函数标度律(FSL),该定律描述了任意LRS下的完整损失轨迹,其中调度的影响通过一个简单的卷积泛函引入。我们进一步将理论实例化为三种代表性LRS——恒定、指数衰减和预热-稳定-衰减(WSD)——并在数据受限和计算受限两种机制中推导出显式标度关系。这些比较解释了关键的经验现象:(i)高容量模型在数据和计算上更高效;(ii)学习率衰减能提升训练效率;(iii)WSD型调度优于纯衰减。最后,在参数规模从0.1B到1B的LLM上进行的实验表明,FSL作为拟合和预测大规模预训练中损失轨迹的代理模型具有实际相关性。