Muon 算法在动量方差缩减下可证明具有更快的收敛速度 (Muon is Provably Faster with Momentum Variance Reduction) - 专知论文

会员服务 ·

0

动量 · 方差 · 算法 · 收敛速度 · 优化器 ·

Muon is Provably Faster with Momentum Variance Reduction

翻译：Muon 算法在动量方差缩减下可证明具有更快的收敛速度

Xun Qian,Hussein Rammal,Dmitry Kovalev,Peter Richtárik

from arxiv, 31 pages, 4 figures

Recent empirical research has demonstrated that deep learning optimizers based on the linear minimization oracle (LMO) over specifically chosen Non-Euclidean norm balls, such as Muon and Scion, outperform Adam-type methods in the training of large language models. In this work, we show that such optimizers can be provably improved by replacing their vanilla momentum by momentum variance reduction (MVR). Instead of proposing and analyzing MVR variants of Muon and Scion separately, we incorporate MVR into the recently proposed Gluon framework, which captures Muon, Scion and other specific Non-Euclidean LMO-based methods as special cases, and at the same time works with a more general smoothness assumption which better captures the layer-wise structure of neural networks. In the non-convex case, we incorporate MVR into Gluon in three different ways. All of them improve the convergence rate from ${\cal O} (\frac{1}{K^{1/4}})$ to ${\cal O} (\frac{1}{K^{1/3}})$. Additionally, we provide improved rates in the star-convex case. Finally, we conduct several numerical experiments that verify the superior performance of our proposed algorithms in terms of iteration complexity.

翻译：近期实证研究表明，基于特定非欧几里得范数球上的线性最小化预言机（LMO）的深度学习优化器，例如 Muon 和 Scion，在大型语言模型训练中优于 Adam 类方法。本文证明，通过将此类优化器中的普通动量替换为动量方差缩减（MVR），可进一步提升其性能。我们并未分别提出和分析 Muon 与 Scion 的 MVR 变体，而是将 MVR 整合到近期提出的 Gluon 框架中。该框架将 Muon、Scion 及其他特定的基于非欧几里得 LMO 的方法作为特例，同时采用更一般的平滑性假设，以更好地捕捉神经网络的分层结构。在非凸情形下，我们通过三种不同方式将 MVR 融入 Gluon 框架。所有方法均将收敛速率从 ${\\cal O} (\\frac{1}{K^{1/4}})$ 提升至 ${\\cal O} (\\frac{1}{K^{1/3}})$。此外，我们在星凸情形下提供了改进的收敛速率。最后，通过数值实验验证了所提算法在迭代复杂度方面的优越性能。

0

相关内容

动量方法 (Polyak, 1964) 旨在加速学习，特别是处理高曲率、小但一致的梯度，或是带噪声的梯度。动量算法积累了之前梯度指数级衰减的移动平均，并且继续沿该方向移动。

【伯克利JD Co-Reyes博士论文】建立强化学习算法泛化:从潜在动力学模型到元学习，Building Reinforcement Learning Algorithms that Generalize: From Latent Dynamics Models to Meta-Learning

【伯克利JD Co-Reyes博士论文】建立强化学习算法泛化:从潜在动力学模型到元学习，Building Reinforcement Learning Algorithms that Generalize: From Latent Dynamics Models to Meta-Learning

专知会员服务

45+阅读 · 2022年3月6日

知识图谱嵌入模型的概率标定,Probability Calibration for Knowledge Graph Embedding Models

专知会员服务

36+阅读 · 2020年5月11日

Python分布式计算，171页pdf，Distributed Computing with Python

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

【AAAI2020】拓扑贝叶斯优化与持久性图：Topological Bayesian Optimization with Persistence Diagrams

【AAAI2020】拓扑贝叶斯优化与持久性图：Topological Bayesian Optimization with Persistence Diagrams

专知会员服务

11+阅读 · 2020年1月17日

【贝叶斯规则因果推理】《Causal Inference with Bayes Rule》by Finn Lattimore, David Rohde

【贝叶斯规则因果推理】《Causal Inference with Bayes Rule》by Finn Lattimore, David Rohde

专知会员服务

48+阅读 · 2019年12月13日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

ICLR 2019 | 基于复杂空间关系旋转的知识表示方法

ICLR 2019 | 基于复杂空间关系旋转的知识表示方法

PaperWeekly

17+阅读 · 2019年7月29日

论文浅尝 | Interaction Embeddings for Prediction and Explanation

论文浅尝 | Interaction Embeddings for Prediction and Explanation

开放知识图谱

11+阅读 · 2019年2月1日

在TensorFlow中对比两大生成模型：VAE与GAN

在TensorFlow中对比两大生成模型：VAE与GAN

机器之心

12+阅读 · 2017年10月23日

SSD: Single Shot MultiBox Detector 深度学习笔记之SSD物体检测模型

SSD: Single Shot MultiBox Detector 深度学习笔记之SSD物体检测模型

AI研习社

18+阅读 · 2017年8月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

基于张量网络算法研究低维量子系统中的非局域关联和量子相变的标度行为

国家自然科学基金

0+阅读 · 2015年12月31日

随机系数和带跳的线性随机微分系统的H2/H∞控制

国家自然科学基金

0+阅读 · 2014年12月31日

Banach空间的嵌入理论及其应用

国家自然科学基金

1+阅读 · 2014年12月31日

随机Helmholtz型问题的数值方法

国家自然科学基金

0+阅读 · 2014年12月31日

Rule-Based Graph Programs Matching the Time Complexity of Imperative Algorithms

Arxiv

0+阅读 · 12月10日

Modeling the Effect of Data Redundancy on Speedup in MLFMA Near-Field Computation

Arxiv

0+阅读 · 11月26日

Systematically Deconstructing APVD Steganography and its Payload with a Unified Deep Learning Paradigm

Arxiv

0+阅读 · 11月20日

From Path Coefficients to Targeted Estimands: A Comparison of Structural Equation Models (SEM) and Targeted Maximum Likelihood Estimation (TMLE)

Arxiv

0+阅读 · 11月16日

Uncertainty-Guided Live Measurement Sequencing for Fast SAR ADC Linearity Testing

Arxiv

0+阅读 · 11月14日

VIP会员

文章信息

相关主题

相关VIP内容

【伯克利JD Co-Reyes博士论文】建立强化学习算法泛化:从潜在动力学模型到元学习，Building Reinforcement Learning Algorithms that Generalize: From Latent Dynamics Models to Meta-Learning

【伯克利JD Co-Reyes博士论文】建立强化学习算法泛化:从潜在动力学模型到元学习，Building Reinforcement Learning Algorithms that Generalize: From Latent Dynamics Models to Meta-Learning

专知会员服务

45+阅读 · 2022年3月6日

知识图谱嵌入模型的概率标定,Probability Calibration for Knowledge Graph Embedding Models

专知会员服务

36+阅读 · 2020年5月11日

Python分布式计算，171页pdf，Distributed Computing with Python

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

【AAAI2020】拓扑贝叶斯优化与持久性图：Topological Bayesian Optimization with Persistence Diagrams

【AAAI2020】拓扑贝叶斯优化与持久性图：Topological Bayesian Optimization with Persistence Diagrams

专知会员服务

11+阅读 · 2020年1月17日

【贝叶斯规则因果推理】《Causal Inference with Bayes Rule》by Finn Lattimore, David Rohde

【贝叶斯规则因果推理】《Causal Inference with Bayes Rule》by Finn Lattimore, David Rohde

专知会员服务

48+阅读 · 2019年12月13日

热门VIP内容

开通专知VIP会员享更多权益服务

前沿人工智能趋势报告（Frontier AI Trends Report）

【AAAI2026】善始则事半功倍：基于前缀优化的大语言模型推理强化学习

Andrej Karpathy：2025 年 LLM 年度回顾（2025 LLM Year in Review）

音退化问题：基于输入操控的鲁棒语音转换综述

相关资讯

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

ICLR 2019 | 基于复杂空间关系旋转的知识表示方法

ICLR 2019 | 基于复杂空间关系旋转的知识表示方法

PaperWeekly

17+阅读 · 2019年7月29日

论文浅尝 | Interaction Embeddings for Prediction and Explanation

论文浅尝 | Interaction Embeddings for Prediction and Explanation

开放知识图谱

11+阅读 · 2019年2月1日

在TensorFlow中对比两大生成模型：VAE与GAN

在TensorFlow中对比两大生成模型：VAE与GAN

机器之心

12+阅读 · 2017年10月23日

SSD: Single Shot MultiBox Detector 深度学习笔记之SSD物体检测模型

SSD: Single Shot MultiBox Detector 深度学习笔记之SSD物体检测模型

AI研习社

18+阅读 · 2017年8月31日

相关论文

Rule-Based Graph Programs Matching the Time Complexity of Imperative Algorithms

Arxiv

0+阅读 · 12月10日

Modeling the Effect of Data Redundancy on Speedup in MLFMA Near-Field Computation

Arxiv

0+阅读 · 11月26日

Systematically Deconstructing APVD Steganography and its Payload with a Unified Deep Learning Paradigm

Arxiv

0+阅读 · 11月20日

From Path Coefficients to Targeted Estimands: A Comparison of Structural Equation Models (SEM) and Targeted Maximum Likelihood Estimation (TMLE)

Arxiv

0+阅读 · 11月16日

Uncertainty-Guided Live Measurement Sequencing for Fast SAR ADC Linearity Testing

Arxiv

0+阅读 · 11月14日

相关基金

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

基于张量网络算法研究低维量子系统中的非局域关联和量子相变的标度行为

国家自然科学基金

0+阅读 · 2015年12月31日

随机系数和带跳的线性随机微分系统的H2/H∞控制

国家自然科学基金

0+阅读 · 2014年12月31日

Banach空间的嵌入理论及其应用

国家自然科学基金

1+阅读 · 2014年12月31日

随机Helmholtz型问题的数值方法

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员