Stochastic Variance Reduced Gradient (SVRG) and its variants aim to speed-up training by using gradient corrections, but have seen limited success in deep learning. Here, we show surprising new foundational connections of SVRG to a recently proposed Bayesian method called posterior correction. Specifically, we show that SVRG is recovered as a special case of posterior correction over the isotropic-Gaussian family, while novel extensions are automatically obtained by using more flexible exponential families. We derive two new SVRG variants by using Gaussian families: First, a Newton-like variant that employs novel Hessian corrections, and second, an Adam-like extension that improves pretraining and finetuning of Transformer language models. This is the first work to connect SVRG to Bayes and use it to boost variational training for deep networks.
翻译:随机方差缩减梯度(SVRG)及其变体旨在通过梯度校正来加速训练,但在深度学习中的应用效果有限。本文揭示了SVRG与近期提出的贝叶斯方法——后验校正之间令人惊讶的基础性联系。具体而言,我们证明SVRG可视为各向同性高斯分布族上后验校正的特例,而通过采用更灵活的指数分布族,可自动推导出新颖的扩展方法。我们基于高斯分布族推导出两种新的SVRG变体:其一为采用新型海森矩阵校正的类牛顿法变体,其二为类Adam扩展方法,该扩展提升了Transformer语言模型的预训练与微调性能。本研究首次建立了SVRG与贝叶斯方法的理论关联,并将其用于增强深度网络的变分训练过程。