SGD SGD 参数数值稳定平坦平坦地区 (Parameter Averaging for SGD Stabilizes the Implicit Bias towards Flat Regions)

Stochastic gradient descent is a workhorse for training deep neural networks due to its excellent generalization performance. Several studies demonstrated this success is attributed to the implicit bias of the method that prefers a flat minimum and developed new methods based on this perspective. Recently, Izmailov et al. (2018) empirically observed that an averaged stochastic gradient descent with a large step size can bring out the implicit bias more effectively and can converge more stably to a flat minimum than the vanilla stochastic gradient descent. In our work, we theoretically justify this observation by showing that the averaging scheme improves the bias-optimization tradeoff coming from the stochastic gradient noise: a large step size amplifies the bias but makes convergence unstable, and vice versa. Specifically, we show that the averaged stochastic gradient descent can get closer to a solution of a penalized objective on the sharpness than the vanilla stochastic gradient descent using the same step size under certain conditions. In experiments, we verify our theory and show this learning scheme significantly improves performance.

翻译：软性梯度下降是训练深神经网络的一匹工马,因为其优异的概括性表现。一些研究表明,这一成功归因于偏爱平定最低限和基于这一视角开发新方法的方法的隐含偏差。最近,Izmailov 等人(2018年)从经验上观察到,一个大步尺的普通随机梯度梯度下降可以更有效地消除隐含偏差,并比香草悬定梯度下降更能刺入一个最低定额。在我们的工作中,我们理论上证明这一观察是正确的,我们通过表明平均法改善了从随机梯度噪音产生的偏向优化交易:一个大步尺的尺寸扩大了偏向,但使趋同不稳定,反之亦然。具体地说,我们表明,平均随机梯度梯度下降可以在某些条件下,利用相同的步位尺码,更接近于香草悬浮性梯度梯度梯度梯度下降目标的解决方案。我们在实验中核查我们的理论,并展示这一学习计划大大改进了绩效。