Choosing an appropriate learning rate remains a key challenge in scaling depth of modern deep networks. The classical maximal update parameterization ($μ$P) enforces a fixed per-layer update magnitude, which is well suited to homogeneous multilayer perceptrons (MLPs) but becomes ill-posed in heterogeneous architectures where residual accumulation and convolutions introduce imbalance across layers. We introduce Arithmetic-Mean $μ$P (AM-$μ$P), which constrains not each individual layer but the network-wide average one-step pre-activation second moment to a constant scale. Combined with a residual-aware He fan-in initialization - scaling residual-branch weights by the number of blocks ($\mathrm{Var}[W]=c/(K\cdot \mathrm{fan\text{-}in})$) - AM-$μ$P yields width-robust depth laws that transfer consistently across depths. We prove that, for one- and two-dimensional convolutional networks, the maximal-update learning rate satisfies $η^\star(L)\propto L^{-3/2}$; with zero padding, boundary effects are constant-level as $N\gg k$. For standard residual networks with general conv+MLP blocks, we establish $η^\star(L)=Θ(L^{-3/2})$, with $L$ the minimal depth. Empirical results across a range of depths confirm the $-3/2$ scaling law and enable zero-shot learning-rate transfer, providing a unified and practical LR principle for convolutional and deep residual networks without additional tuning overhead.
翻译:选择合适的学习率仍是扩展现代深度网络深度的关键挑战。经典的最大更新参数化(μP)强制每层更新幅度固定,这适用于同质多层感知机(MLP),但在残差累积与卷积引入层间不平衡的异质架构中变得不适定。我们提出算术平均μP(AM-μP),其约束对象并非单个层,而是网络范围内单步预激活二阶矩的平均值,使其保持恒定尺度。结合残差感知的He扇入初始化——将残差分支权重按块数缩放(Var[W]=c/(K·fan-in))——AM-μP可产生宽度鲁棒的深度定律,在不同深度间保持一致的迁移性。我们证明,对于一维和二维卷积网络,最大更新学习率满足η*(L)∝L^{-3/2};在零填充条件下,边界效应在N≫k时为常数级别。对于包含通用卷积+MLP模块的标准残差网络,我们确立η*(L)=Θ(L^{-3/2}),其中L为最小深度。跨多种深度的实证结果验证了-3/2缩放定律,并实现了零样本学习率迁移,为卷积网络与深度残差网络提供了统一且实用的学习率原则,无需额外调参开销。