通过低成本差分优化分布式训练系统中的频繁检查点机制 (Optimizing Frequent Checkpointing via Low-Cost Differential for Distributed Training Systems)

Distributed training of large deep-learning models often leads to failures, so checkpointing is commonly employed for recovery. State-of-the-art studies focus on frequent checkpointing for fast recovery from failures. However, it generates numerous checkpoints, incurring substantial costs and thus degrading training performance. Recently, differential checkpointing has been proposed to reduce costs, but it is limited to recommendation systems, so its application to general distributed training systems remains unexplored. We proposes \sysname, an efficient frequent checkpointing framework that \textit{reuses} compressed gradients, serving as differential checkpoints to reduce cost. Furthermore, \sysname incorporates a batched gradient write optimization to persist these differentials to storage efficiently. It also dynamically tunes both the checkpoint frequency and the batching size to maximize performance. In non-compression scenario, We further proposes \sysnameplus with a layer-wise gradient reusing and snapshotting approach and a CPU-based asynchronous persistence strategy, enabling frequent checkpointing without gradient compression. Experiments on various workloads show that \sysname can achieve checkpointing frequency up to per iteration with less than 3.1\% runtime overhead.

翻译：大规模深度学习模型的分布式训练常因故障中断，因此通常采用检查点机制进行恢复。当前前沿研究侧重于通过频繁检查点实现快速故障恢复，但这会生成大量检查点，产生显著开销并降低训练性能。近期提出的差分检查点技术虽能降低成本，但仅限于推荐系统应用，其在通用分布式训练系统中的适用性尚未得到探索。本文提出\\sysname，一种高效的频繁检查点框架，通过\\textit{复用}压缩梯度作为差分检查点以降低开销。此外，\\sysname采用批量梯度写入优化策略，将差分数据高效持久化存储，并动态调整检查点频率与批处理大小以最大化性能。在无压缩场景下，我们进一步提出\\sysnameplus，通过分层梯度复用与快照技术，结合基于CPU的异步持久化策略，实现无需梯度压缩的频繁检查点。多场景实验表明，\\sysname可实现每迭代周期的检查点频率，且运行时开销低于3.1%。