Near an optimal learning point of a neural network, the learning performance of gradient descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters. We characterize the Hessian eigenspectrum for some classes of teacher-student problems, when the teacher and student networks have matching weights, showing that the smaller eigenvalues of the Hessian determine long-time learning performance. For linear networks, we analytically establish that for large networks the spectrum asymptotically follows a convolution of a scaled chi-square distribution with a scaled Marchenko-Pastur distribution. We numerically analyse the Hessian spectrum for polynomial and other non-linear networks. Furthermore, we show that the rank of the Hessian matrix can be seen as an effective number of parameters for networks using polynomial activation functions. For a generic non-linear activation function, such as the error function, we empirically observe that the Hessian matrix is always full rank.
翻译:在神经网络的最优学习点附近,梯度下降动态的学习性能由损失函数相对于网络参数的Hessian矩阵决定。我们针对某些师生问题类别(当师生网络权重匹配时)刻画了Hessian矩阵的特征谱,表明Hessian的较小特征值决定了长期学习性能。对于线性网络,我们通过解析证明:对于大型网络,该谱渐近地服从缩放卡方分布与缩放马尔琴科-帕斯图尔分布的卷积。我们对多项式及其他非线性网络的Hessian谱进行了数值分析。此外,我们证明对于使用多项式激活函数的网络,Hessian矩阵的秩可视为有效参数数量。对于一般非线性激活函数(如误差函数),我们通过实证观察到Hessian矩阵始终满秩。