Sentiments about the reproducibility of cited papers in downstream literature offer community perspectives and have shown as a promising signal of the actual reproducibility of published findings. To train effective models to effectively predict reproducibility-oriented sentiments and further systematically study their correlation with reproducibility, we introduce the CC30k dataset, comprising a total of 30,734 citation contexts in machine learning papers. Each citation context is labeled with one of three reproducibility-oriented sentiment labels: Positive, Negative, or Neutral, reflecting the cited paper's perceived reproducibility or replicability. Of these, 25,829 are labeled through crowdsourcing, supplemented with negatives generated through a controlled pipeline to counter the scarcity of negative labels. Unlike traditional sentiment analysis datasets, CC30k focuses on reproducibility-oriented sentiments, addressing a research gap in resources for computational reproducibility studies. The dataset was created through a pipeline that includes robust data cleansing, careful crowd selection, and thorough validation. The resulting dataset achieves a labeling accuracy of 94%. We then demonstrated that the performance of three large language models significantly improves on the reproducibility-oriented sentiment classification after fine-tuning using our dataset. The dataset lays the foundation for large-scale assessments of the reproducibility of machine learning papers. The CC30k dataset and the Jupyter notebooks used to produce and analyze the dataset are publicly available at https://github.com/lamps-lab/CC30k .
翻译:下游文献中对被引论文可重复性的情感反映了学术共同体的观点,并已显示出作为已发表成果实际可重复性的有效信号。为了训练能够有效预测面向可重复性情感并进一步系统研究其与可重复性相关性的模型,我们引入了CC30k数据集,该数据集包含机器学习论文中总计30,734个引文上下文。每个引文上下文均被标记为以下三种面向可重复性的情感标签之一:正面、负面或中性,以反映被引论文在可重复性或可复现性方面的感知情况。其中,25,829个样本通过众包标注,并通过受控流程生成的负面样本进行补充,以缓解负面标签稀缺的问题。与传统情感分析数据集不同,CC30k专注于面向可重复性的情感,填补了计算可重复性研究领域资源方面的空白。该数据集通过包含稳健数据清洗、审慎众包人员筛选和全面验证的流程构建而成,最终标注准确率达到94%。我们随后证明,三种大型语言模型在使用本数据集进行微调后,在面向可重复性的情感分类任务上性能显著提升。该数据集为大规模评估机器学习论文的可重复性奠定了基础。CC30k数据集及用于生成和分析该数据集的Jupyter笔记本已在https://github.com/lamps-lab/CC30k 公开提供。