Methods for detecting label errors in training data require models that are robust to label errors (i.e., not fit to erroneously labelled data points). However, acquiring such models often involves training on corrupted data, which presents a challenge. Adjustments to the loss function present an opportunity for improvement. Motivated by Focal Loss (which emphasizes difficult-to-classify samples), two novel, yet simple, loss functions are proposed that de-weight or ignore these difficult samples (i.e., those likely to have label errors). Results on artificially corrupted data show promise, such that F1 scores for detecting errors are improved from the baselines of conventional categorical Cross Entropy and Focal Loss.
翻译:检测训练数据中标签错误的方法需要模型对标签错误具有鲁棒性(即不拟合错误标记的数据点)。然而,获取此类模型通常需要在受污染的数据上进行训练,这带来了挑战。对损失函数进行调整提供了改进的机会。受Focal Loss(强调难以分类的样本)的启发,本文提出了两种新颖而简单的损失函数,它们对困难样本(即可能含有标签错误的样本)进行降权或忽略。在人工污染数据上的实验结果显示,这些方法在检测错误方面的F1分数相较于传统的分类交叉熵和Focal Loss基线有所提升,展现出良好的应用前景。