Recently, distributed GNN training frameworks, such as DistDGL and PyG, have been developed to enable training GNN models on large graphs by leveraging multiple GPUs in a distributed manner. Despite these advances, their memory requirements are still excessively high, thereby hindering GNN training on large graphs using commodity workstations. In this paper, we propose SDT-GNN, a streaming-based distributed GNN training framework. Unlike the existing frameworks that load the entire graph in memory, it takes a stream of edges as input for graph partitioning to reduce the memory requirement for partitioning. It also enables distributed GNN training even when the aggregated memory size of GPUs is smaller than the size of the graph and feature data. Furthermore, to improve the quality of partitioning, we propose SPRING, a novel streaming partitioning algorithm for distributed GNN training. We demonstrate the effectiveness and efficiency of SDT-GNN on seven large public datasets. SDT-GNN has up to 95% less memory footprint than DistDGL and PyG without sacrificing the prediction accuracy. SPRING also outperforms state-of-the-art streaming partitioning algorithms significantly.
翻译:近年来,诸如DistDGL和PyG等分布式图神经网络(GNN)训练框架已被开发出来,通过利用多GPU进行分布式处理,实现了在大规模图数据上训练GNN模型。尽管取得了这些进展,这些框架的内存需求仍然过高,从而阻碍了在商用工作站上利用大型图进行GNN训练。本文提出SDT-GNN,一种基于流式处理的分布式GNN训练框架。与现有框架将整个图加载到内存中不同,SDT-GNN以边流作为输入进行图划分,从而降低了划分过程中的内存需求。即使GPU的聚合内存容量小于图及特征数据的大小,该框架也能支持分布式GNN训练。此外,为提高划分质量,我们提出了SPRING,一种专为分布式GNN训练设计的新型流式划分算法。我们在七个大型公共数据集上验证了SDT-GNN的有效性与效率。相较于DistDGL和PyG,SDT-GNN在保持预测精度的同时,内存占用量最高可减少95%。SPRING算法也显著优于当前最先进的流式划分算法。