We develop a high-dimensional scaling limit for Stochastic Gradient Descent with Polyak Momentum (SGD-M) and adaptive step-sizes. This provides a framework to rigourously compare online SGD with some of its popular variants. We show that the scaling limits of SGD-M coincide with those of online SGD after an appropriate time rescaling and a specific choice of step-size. However, if the step-size is kept the same between the two algorithms, SGD-M will amplify high-dimensional effects, potentially degrading performance relative to online SGD. We demonstrate our framework on two popular learning problems: Spiked Tensor PCA and Single Index Models. In both cases, we also examine online SGD with an adaptive step-size based on normalized gradients. In the high-dimensional regime, this algorithm yields multiple benefits: its dynamics admit fixed points closer to the population minimum and widens the range of admissible step-sizes for which the iterates converge to such solutions. These examples provide a rigorous account, aligning with empirical motivation, of how early preconditioners can stabilize and improve dynamics in settings where online SGD fails.
翻译:我们为带有Polyak动量的随机梯度下降(SGD-M)及自适应步长建立了高维尺度极限。这为严格比较在线SGD与其若干流行变体提供了一个框架。我们证明,经过适当的时间重缩放及特定步长选择后,SGD-M的尺度极限与在线SGD的尺度极限一致。然而,若两种算法保持相同步长,SGD-M会放大高维效应,可能导致性能相对于在线SGD下降。我们在两个经典学习问题上展示了该框架:尖峰张量PCA和单指标模型。在这两种情况下,我们还研究了基于归一化梯度的自适应步长在线SGD。在高维体系中,该算法带来多重优势:其动力学允许更接近总体最小值的固定点,并扩展了迭代收敛至此类解的可容许步长范围。这些示例提供了严谨的理论解释,与经验动机相符,说明了在在线SGD失效的场景中,早期预条件子如何稳定并改善动力学行为。