Skip connections and normalisation layers form two standard architectural components that are ubiquitous for the training of Deep Neural Networks (DNNs), but whose precise roles are poorly understood. Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them, using insights from wide NN kernel theory to improve signal propagation in vanilla DNNs (which we define as networks without skips or normalisation). However, these approaches are incompatible with the self-attention layers present in transformers, whose kernels are intrinsically more complicated to analyse and control. And so the question remains: is it possible to train deep vanilla transformers? We answer this question in the affirmative by designing several approaches that use combinations of parameter initialisations, bias matrices and location-dependent rescaling to achieve faithful signal propagation in vanilla transformers. Our methods address various intricacies specific to signal propagation in transformers, including the interaction with positional encoding and causal masking. In experiments on WikiText-103 and C4, our approaches enable deep transformers without normalisation to train at speeds matching their standard counterparts, and deep vanilla transformers to reach the same performance as standard ones after about 5 times more iterations.
翻译:跳过连接和正常化层形成两个标准的建筑构件,这些构件对于深神经网络(DNNS)的培训来说无处不在,但其准确作用却不为人所知。深海内核形状等最近的一些方法在减少我们对它们的依赖方面取得了进展,它们利用宽广的NNN内核理论的洞见来改进香草 DNN(我们将其定义为网络,没有跳过或正常化)的信号传播。然而,这些方法与变压器中存在的自我注意层不相容,这些变压器的内核本质上更难分析和控制。因此,问题仍然是:能否培训深香草变压器?我们肯定地回答这个问题,我们设计了几种方法,利用参数初始化、偏差矩阵和根据地点的再缩放等组合,在香草变压器中实现忠实的信号传播。我们的方法针对变压器中信号传播所特有的各种内在的复杂因素,包括与定位编码和因果遮掩的相互作用。在WikText-103和C4的实验中,我们的方法使得深变压器能够在没有正常的情况下进行与标准对等速度的训练。