Recent empirical research has demonstrated that deep learning optimizers based on the linear minimization oracle (LMO) over specifically chosen Non-Euclidean norm balls, such as Muon and Scion, outperform Adam-type methods in the training of large language models. In this work, we show that such optimizers can be provably improved by replacing their vanilla momentum by momentum variance reduction (MVR). Instead of proposing and analyzing MVR variants of Muon and Scion separately, we incorporate MVR into the recently proposed Gluon framework, which captures Muon, Scion and other specific Non-Euclidean LMO-based methods as special cases, and at the same time works with a more general smoothness assumption which better captures the layer-wise structure of neural networks. In the non-convex case, we incorporate MVR into Gluon in three different ways. All of them improve the convergence rate from ${\cal O} (\frac{1}{K^{1/4}})$ to ${\cal O} (\frac{1}{K^{1/3}})$. Additionally, we provide improved rates in the star-convex case. Finally, we conduct several numerical experiments that verify the superior performance of our proposed algorithms in terms of iteration complexity.
翻译:近期实证研究表明,基于特定非欧几里得范数球上的线性最小化预言机(LMO)的深度学习优化器,例如 Muon 和 Scion,在大型语言模型训练中优于 Adam 类方法。本文证明,通过将此类优化器中的普通动量替换为动量方差缩减(MVR),可进一步提升其性能。我们并未分别提出和分析 Muon 与 Scion 的 MVR 变体,而是将 MVR 整合到近期提出的 Gluon 框架中。该框架将 Muon、Scion 及其他特定的基于非欧几里得 LMO 的方法作为特例,同时采用更一般的平滑性假设,以更好地捕捉神经网络的分层结构。在非凸情形下,我们通过三种不同方式将 MVR 融入 Gluon 框架。所有方法均将收敛速率从 ${\\cal O} (\\frac{1}{K^{1/4}})$ 提升至 ${\\cal O} (\\frac{1}{K^{1/3}})$。此外,我们在星凸情形下提供了改进的收敛速率。最后,通过数值实验验证了所提算法在迭代复杂度方面的优越性能。