HVAdam：一种全维度自适应优化器 (HVAdam: A Full-Dimension Adaptive Optimizer)

Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer's ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity , allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad's hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.

翻译：Adam等自适应优化器在训练大规模模型（如大语言模型和扩散模型）方面取得了巨大成功。然而，在经典架构（如CNN）上，其泛化能力通常不如非自适应方法（如SGD）。我们发现了这一性能差距的关键原因：预条件器中的自适应性限制了优化器适应多样化优化场景的能力。为解决这一问题，我们提出Anon（具有新颖收敛技术的无限制自适应性优化器），这是一种具有连续可调自适应性的新型优化器，使其能够在类SGD与类Adam行为之间插值，甚至能外推超越两者。为确保在整个自适应性谱系中的收敛性，我们引入了增量延迟更新（IDU），这是一种比AMSGrad的硬性最大跟踪策略更灵活的新机制，并增强了对梯度噪声的鲁棒性。我们在理论上建立了凸与非凸设置下的收敛保证。实证结果表明，Anon在代表性的图像分类、扩散和语言建模任务中持续优于最先进的优化器。这些结果证明，自适应性可作为一种有价值的可调设计原则，而Anon提供了首个统一且可靠的框架，能够弥合经典与现代优化器之间的差距，并超越其各自的优势特性。