Orthogonal momentum gradient updates have emerged to overcome the limitations of vector-based optimizers like Adam. The vector-based optimizer Adam suffers from high memory costs and ill-conditioned momentum gradient updates. However, traditional Orthogonal momentum approaches, such as SVD/QR decomposition, suffer from high computational and memory costs and underperform compared to well-tuned SGD with momentum. Recent advances, such as Muon, improve efficiency by applying momentum before orthogonalization and approximate orthogonal matrices via Newton-Schulz iterations, which gives better GPU utilization, active high TFLOPS, and reduces memory usage by up to 3x. Nevertheless, Muon(Vanilla) suffers from exploding attention logits and has cubic computation complexity. In this paper, we deep dive into orthogonal momentum gradient updates to find the main properties that help Muon achieve remarkable performance. We propose AuON (Alternative Unit-norm momentum updates by Normalized nonlinear scaling), a linear-time optimizer that achieves strong performance without approximate orthogonal matrices, while preserving structural alignment and reconditioning ill-posed updates. AuON has an automatic "emergency brake" to handle exploding attention logits. We further introduce a hybrid variant, Hybrid-AuON, that applies the linear transformations with Newton-Schulz iterations, which outperforms Muon in the language modeling tasks. Code is available at: https://github.com/ryyzn9/AuON
翻译:正交动量梯度更新方法已出现,以克服Adam等基于向量的优化器的局限性。基于向量的优化器Adam存在内存成本高和动量梯度更新病态的问题。然而,传统的正交动量方法(如SVD/QR分解)存在计算和内存成本高的问题,且与经过良好调优的带动量SGD相比表现不佳。近期进展(如Muon)通过在正交化前应用动量,并通过Newton-Schulz迭代近似正交矩阵来提高效率,从而获得更好的GPU利用率、实现高TFLOPS,并将内存使用降低多达3倍。尽管如此,Muon(基础版)仍存在注意力对数爆炸问题,且具有立方计算复杂度。本文深入探究正交动量梯度更新,以找出帮助Muon取得显著性能的关键特性。我们提出AuON(通过归一化非线性缩放的替代单位范数动量更新),这是一种线性时间优化器,无需近似正交矩阵即可实现强大性能,同时保持结构对齐并重新调节病态更新。AuON具备自动的“紧急制动”机制来处理注意力对数爆炸问题。我们进一步提出混合变体Hybrid-AuON,它结合Newton-Schulz迭代应用线性变换,在语言建模任务中表现优于Muon。代码发布于:https://github.com/ryyzn9/AuON