While data scaling laws of large language models (LLMs) have been widely examined in the one-pass regime with massive corpora, their form under limited data and repeated epochs remains largely unexplored. This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws in linear regression. Concretely, we ask: to match the performance of training on a dataset of size $N$ for $K$ epochs, how much larger must a dataset be if the model is trained for only one pass? We quantify this using the \textit{effective reuse rate} of the data, $E(K, N)$, which we define as the multiplicative factor by which the dataset must grow under one-pass training to achieve the same test loss as $K$-epoch training. Our analysis precisely characterizes the scaling behavior of $E(K, N)$ for SGD in linear regression under either strong convexity or Zipf-distributed data: (1) When $K$ is small, we prove that $E(K, N) \approx K$, indicating that every new epoch yields a linear gain; (2) As $K$ increases, $E(K, N)$ plateaus at a problem-dependent value that grows with $N$ ($Θ(\log N)$ for the strongly-convex case), implying that larger datasets can be repeated more times before the marginal benefit vanishes. These theoretical findings point out a neglected factor in a recent empirical study (Muennighoff et al. (2023)), which claimed that training LLMs for up to $4$ epochs results in negligible loss differences compared to using fresh data at each step, \textit{i.e.}, $E(K, N) \approx K$ for $K \le 4$ in our notation. Supported by further empirical validation with LLMs, our results reveal that the maximum $K$ value for which $E(K, N) \approx K$ in fact depends on the data size and distribution, and underscore the need to explicitly model both factors in future studies of scaling laws with data reuse.
翻译:尽管大型语言模型(LLMs)在单次遍历海量语料库下的数据缩放规律已被广泛研究,但在有限数据与重复轮次下的形式仍基本未被探索。本文对一种常见变通方案——在同一数据集上进行多轮次训练——如何重塑线性回归中的数据缩放规律进行了理论分析。具体而言,我们提出:若要在仅单次遍历训练下达到与在规模为 $N$ 的数据集上训练 $K$ 轮次相同的性能,数据集需要扩大多少?我们通过数据的\textit{有效复用率} $E(K, N)$ 来量化这一点,其定义为单次遍历训练下数据集需增长的乘性因子,以实现与 $K$ 轮次训练相同的测试损失。我们的分析精确刻画了在线性回归中,对于强凸或Zipf分布数据下随机梯度下降(SGD)的 $E(K, N)$ 缩放行为:(1)当 $K$ 较小时,我们证明 $E(K, N) \approx K$,表明每新增一轮次带来线性增益;(2)随着 $K$ 增加,$E(K, N)$ 会趋于一个随 $N$ 增长的问题依赖值(强凸情形下为 $Θ(\log N)$),这意味着更大数据集在边际效益消失前可被重复更多次。这些理论发现指出了近期一项实证研究(Muennighoff等人(2023))中被忽视的因素,该研究声称训练LLMs至多 $4$ 轮次与每一步使用新数据相比损失差异可忽略,即用我们的符号表示为 $E(K, N) \approx K$($K \le 4$)。通过LLMs的进一步实证验证支持,我们的结果表明 $E(K, N) \approx K$ 成立的最大 $K$ 值实际上取决于数据规模与分布,并强调在未来研究数据复用的缩放规律时需明确建模这两个因素。