The use of momentum in stochastic optimization algorithms has shown empirical success across a range of machine learning tasks. Recently, a new class of stochastic momentum algorithms has emerged within the Linear Minimization Oracle (LMO) framework--leading to state-of-the-art methods, such as Muon, Scion, and Gluon, that effectively solve deep neural network training problems. However, traditional stochastic momentum methods offer convergence guarantees no better than the ${O}(1/K^{1/4})$ rate. While several approaches--such as Hessian-Corrected Momentum (HCM)--have aimed to improve this rate, their theoretical results are generally restricted to the Euclidean norm setting. This limitation hinders their applicability in problems, where arbitrary norms are often required. In this paper, we extend the LMO-based framework by integrating HCM, and provide convergence guarantees under relaxed smoothness and arbitrary norm settings. We establish improved convergence rates of ${O}(1/K^{1/3})$ for HCM, which can adapt to the geometry of the problem and achieve a faster rate than traditional momentum. Experimental results on training Multi-Layer Perceptrons (MLPs) and Long Short-Term Memory (LSTM) networks verify our theoretical observations.
翻译:动量技术在随机优化算法中的应用已在多种机器学习任务中展现出实证优势。近期,基于线性最小化预言机框架的新型随机动量算法类别逐渐兴起,催生了诸如Muon、Scion和Gluon等前沿方法,这些方法能有效解决深度神经网络训练问题。然而,传统随机动量方法仅能保证${O}(1/K^{1/4})$阶的收敛速率。尽管已有若干改进方案——如海森修正动量法——试图提升该速率,但其理论成果通常局限于欧几里得范数设定,这一限制阻碍了其在需要任意范数的问题场景中的应用。本文通过将海森修正动量法整合至线性最小化预言机框架,在放宽光滑性要求与任意范数设定下给出了收敛性保证。我们证明了海森修正动量法可达${O}(1/K^{1/3})$阶的改进收敛速率,该方法能自适应问题几何结构,且收敛速度优于传统动量法。在多层感知机与长短期记忆网络的训练实验中,实验结果验证了我们的理论发现。