LayerPipe2：通过改进的指数移动平均实现神经网络训练的多阶段流水线与权重重计算 (LayerPipe2: Multistage Pipelining and Weight Recompute via Improved Exponential Moving Average for Training Neural Networks)

In our prior work, LayerPipe, we had introduced an approach to accelerate training of convolutional, fully connected, and spiking neural networks by overlapping forward and backward computation. However, despite empirical success, a principled understanding of how much gradient delay needs to be introduced at each layer to achieve desired level of pipelining was not addressed. This paper, LayerPipe2, fills that gap by formally deriving LayerPipe using variable delayed gradient adaptation and retiming. We identify where delays may be legally inserted and show that the required amount of delay follows directly from the network structure where inner layers require fewer delays and outer layers require longer delays. When pipelining is applied at every layer, the amount of delay depends only on the number of remaining downstream stages. When layers are pipelined in groups, all layers in the group share the same assignment of delays. These insights not only explain previously observed scheduling patterns but also expose an often overlooked challenge that pipelining implicitly requires storage of historical weights. We overcome this storage bottleneck by developing a pipeline--aware moving average that reconstructs the required past states rather than storing them explicitly. This reduces memory cost without sacrificing the accuracy guarantees that makes pipelined learning viable. The result is a principled framework that illustrates how to construct LayerPipe architectures, predicts their delay requirements, and mitigates their storage burden, thereby enabling scalable pipelined training with controlled communication computation tradeoffs.

翻译：在我们先前的工作LayerPipe中，我们提出了一种通过重叠前向与反向计算来加速卷积神经网络、全连接网络及脉冲神经网络训练的方法。然而，尽管取得了实证成功，对于为实现特定流水线级别而需要在每层引入多少梯度延迟的原理性理解尚未得到解决。本文提出的LayerPipe2通过基于可变延迟梯度适应与重定时的形式化推导填补了这一空白。我们明确了延迟可合法插入的位置，并证明所需延迟量直接由网络结构决定：内层需要较少延迟，而外层需要较长延迟。当流水线应用于每一层时，延迟量仅取决于剩余下游阶段的数量；当层以分组形式进行流水线处理时，组内所有层共享相同的延迟分配方案。这些见解不仅解释了先前观察到的调度模式，还揭示了一个常被忽视的挑战：流水线隐式要求存储历史权重。我们通过开发一种流水线感知的移动平均方法克服了这一存储瓶颈，该方法重构所需的过去状态而非显式存储，从而在不牺牲流水线学习可行性所依赖的精度保证的前提下降低了内存成本。最终，我们建立了一个原理性框架，阐明如何构建LayerPipe架构、预测其延迟需求并缓解其存储负担，从而实现具有可控通信计算权衡的可扩展流水线训练。