Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer's ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity , allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad's hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.
翻译:Adam等自适应优化器在训练大规模模型(如大语言模型和扩散模型)方面取得了巨大成功。然而,在经典架构(如CNN)上,其泛化能力通常不如非自适应方法(如SGD)。我们发现了这一性能差距的关键原因:预条件器中的自适应性限制了优化器适应多样化优化场景的能力。为解决这一问题,我们提出Anon(具有新颖收敛技术的无限制自适应性优化器),这是一种具有连续可调自适应性的新型优化器,使其能够在类SGD与类Adam行为之间插值,甚至能外推超越两者。为确保在整个自适应性谱系中的收敛性,我们引入了增量延迟更新(IDU),这是一种比AMSGrad的硬性最大跟踪策略更灵活的新机制,并增强了对梯度噪声的鲁棒性。我们在理论上建立了凸与非凸设置下的收敛保证。实证结果表明,Anon在代表性的图像分类、扩散和语言建模任务中持续优于最先进的优化器。这些结果证明,自适应性可作为一种有价值的可调设计原则,而Anon提供了首个统一且可靠的框架,能够弥合经典与现代优化器之间的差距,并超越其各自的优势特性。