Top-N item recommendation has been a widely studied task from implicit feedback. Although much progress has been made with neural methods, there is increasing concern on appropriate evaluation of recommendation algorithms. In this paper, we revisit alternative experimental settings for evaluating top-N recommendation algorithms, considering three important factors, namely dataset splitting, sampled metrics and domain selection. We select eight representative recommendation algorithms (covering both traditional and neural methods) and construct extensive experiments on a very large dataset. By carefully revisiting different options, we make several important findings on the three factors, which directly provide useful suggestions on how to appropriately set up the experiments for top-N item recommendation.
翻译:虽然神经方法取得了很大进展,但人们越来越关注建议算法的适当评价。在本文件中,我们重新审视了用于评价建议算法的替代实验环境,考虑了三个重要因素,即数据集分离、抽样计量和域选择。我们选择了八个具有代表性的建议算法(包括传统和神经方法),并在一个非常庞大的数据集上进行了广泛的实验。我们仔细重新审视了不同的选项,就这三个因素得出了若干重要结论,这三点直接提供了有用的建议,说明如何为项目顶端建议适当建立实验。