This paper presents a detailed comparative analysis of the performance of three major Python data manipulation libraries - Pandas, Polars, and Dask - specifically when embedded within complete deep learning (DL) training and inference pipelines. The research bridges a gap in existing literature by studying how these libraries interact with substantial GPU workloads during critical phases like data loading, preprocessing, and batch feeding. The authors measured key performance indicators including runtime, memory usage, disk usage, and energy consumption (both CPU and GPU) across various machine learning models and datasets.
翻译:本文针对三种主流Python数据操作库——Pandas、Polars和Dask——在完整深度学习训练与推理流程中的性能表现进行了详细的比较分析。该研究通过考察这些库在数据加载、预处理和批次供给等关键阶段如何与大规模GPU计算负载交互,填补了现有文献的空白。作者测量了多种机器学习模型和数据集下的关键性能指标,包括运行时间、内存使用量、磁盘使用量以及能耗(涵盖CPU与GPU)。