关于跨项目软件失灵预测数据抽取方法有效性的经验研究 (An Empirical Study on the Effectiveness of Data Resampling Approaches for Cross-Project Software Defect Prediction)

Crossp-roject defect prediction (CPDP), where data from different software projects are used to predict defects, has been proposed as a way to provide data for software projects that lack historical data. Evaluations of CPDP models using the Nearest Neighbour (NN) Filter approach have shown promising results in recent studies. A key challenge with defect-prediction datasets is class imbalance, that is highly skewed datasets where non buggy modules dominate the buggy modules. In the past, data resampling approaches have been applied to within-projects defect prediction models to help alleviate the negative effects of class imbalance in the datasets. To address the class imbalance issue in CPDP, the authors assess the impact of data resampling approaches on CPDP models after the NN Filter is applied. The impact on prediction performance of five oversampling approaches (MAHAKIL, SMOTE, Borderline-SMOTE, Random Oversampling, and ADASYN) and three undersampling approaches (Random Undersampling, Tomek Links, and Onesided selection) is investigated and results are compared to approaches without data resampling. The authors' examined six defect prediction models on 34 datasets extracted from the PROMISE repository. The authors results show that there is a significant positive effect of data resampling on CPDP performance, suggesting that software quality teams and researchers should consider applying data resampling approaches for improved recall (pd) and g-measure prediction performance. However if the goal is to improve precision and reduce false alarm (pf) then data resampling approaches should be avoided.

翻译：使用不同软件项目的数据来预测缺陷的交叉点缺陷预测(CPDP)是用来为缺乏历史数据的软件项目提供数据的一种方法。使用近邻过滤器(NN)过滤法对CPCPP模型的评估在最近的研究中显示出了有希望的结果。缺陷预防数据集的主要挑战在于阶级失衡,即高度扭曲的数据集,即非错误模块在错误模块中占主导地位。过去,对项目内部的缺陷预测模型采用了数据抽查方法,以帮助减轻数据集中阶级不平衡的负面影响。为了解决CPP中的阶级不平衡问题,作者评估了数据抽查方法对NNT过滤器后CPP模型的影响。对五种过度取样方法(MAHAKIL、SMOTE、边线-SMOTE、随机过错抽查和ADSYN)的预测性能的影响,以及三种抽查不足的方法(Random Broup Brouping、Monil和单面选择)应该进行调查,然后将数据结果与预测方法进行比较。