As emerging deep neural network (DNN) models continue to grow in size, using large GPU clusters to train DNNs is becoming an essential requirement to achieving acceptable training times. In this paper, we consider the case where future increases in cluster size will cause the global batch size that can be used to train models to reach a fundamental limit: beyond a certain point, larger global batch sizes cause sample efficiency to degrade, increasing overall time to accuracy. As a result, to achieve further improvements in training performance, we must instead consider "strong scaling" strategies that hold the global batch size constant and allocate smaller batches to each GPU. Unfortunately, this makes it significantly more difficult to use cluster resources efficiently. We present DeepPool, a system that addresses this efficiency challenge through two key ideas. First, burst parallelism allocates large numbers of GPUs to foreground jobs in bursts to exploit the unevenness in parallelism across layers. Second, GPU multiplexing prioritizes throughput for foreground training jobs, while packing in background training jobs to reclaim underutilized GPU resources, thereby improving cluster-wide utilization. Together, these two ideas enable DeepPool to deliver a 2.2 - 2.4x improvement in total cluster throughput over standard data parallelism with a single task when the cluster scale is large.
翻译:随着新兴的深层神经网络(DNN)模式继续扩大规模,使用大型GPU集群来培训DNN(DNN)模式正在成为实现可接受的培训时间的一个基本要求。在本文件中,我们考虑了这样的情况,即未来集群规模的增加将会导致全球批量规模的扩大,从而能够使模型培训达到一个根本的极限:超过某一点,更大的全球批量规模导致取样效率下降,增加总体时间的准确性。因此,为了进一步改善培训业绩,我们必须考虑“大幅扩大”战略,使全球批量不变,并将较小的批量分配到每个GPU。不幸的是,这极大地增加了集资源的使用难度。我们介绍了Deep Pool,这是一个通过两个关键想法应对这一效率挑战的系统。首先,爆发平行主义分配了大量的GPU到地面工作,以便利用各层平行的不均势。 其次,GPUPU将地面培训的吞吐量排在地面培训工作中,同时将背景培训工作包装成回收利用不到的GPU资源,从而改进整个集群的利用情况。这两个想法使得GPO(GPOL)能够同时完成一个2.4级的平行任务。