一份研究报告,题为 " 增加数据对在噪音标签存在时培训革命神经网络的影响 " ; (A Study on the Impact of Data Augmentation for Training Convolutional Neural Networks in the Presence of Noisy Labels)

Label noise is common in large real-world datasets, and its presence harms the training process of deep neural networks. Although several works have focused on the training strategies to address this problem, there are few studies that evaluate the impact of data augmentation as a design choice for training deep neural networks. In this work, we analyse the model robustness when using different data augmentations and their improvement on the training with the presence of noisy labels. We evaluate state-of-the-art and classical data augmentation strategies with different levels of synthetic noise for the datasets MNist, CIFAR-10, CIFAR-100, and the real-world dataset Clothing1M. We evaluate the methods using the accuracy metric. Results show that the appropriate selection of data augmentation can drastically improve the model robustness to label noise, increasing up to 177.84% of relative best test accuracy compared to the baseline with no augmentation, and an increase of up to 6% in absolute value with the state-of-the-art DivideMix training strategy.

翻译：大型真实世界数据集中常见的Label噪音,其存在损害了深神经网络的培训过程。虽然一些工作侧重于解决这一问题的培训战略,但很少有研究评估数据增强作为深神经网络培训设计选择的影响。在这项工作中,我们分析使用不同数据增强器时的模型稳健性及其在使用噪音标签的情况下对培训的改进。我们评估了最新和古典数据增强战略,对MNist、CIFAR-10、CIFAR-100和真实世界数据集来说,合成噪音程度不同。我们用精确度衡量方法评估了数据增强方法。结果显示,适当选择数据增强方法可以大大改进标注噪音的模型稳健性,将相对最佳测试精度提高到177.84%,与没有增强的基线相比,不会增加,而且随着最新差异Mix培训战略,绝对值增加6%。