CC30k：一个面向可重复性情感分析的引文上下文数据集 (CC30k: A Citation Contexts Dataset for Reproducibility-Oriented Sentiment Analysis)

Sentiments about the reproducibility of cited papers in downstream literature offer community perspectives and have shown as a promising signal of the actual reproducibility of published findings. To train effective models to effectively predict reproducibility-oriented sentiments and further systematically study their correlation with reproducibility, we introduce the CC30k dataset, comprising a total of 30,734 citation contexts in machine learning papers. Each citation context is labeled with one of three reproducibility-oriented sentiment labels: Positive, Negative, or Neutral, reflecting the cited paper's perceived reproducibility or replicability. Of these, 25,829 are labeled through crowdsourcing, supplemented with negatives generated through a controlled pipeline to counter the scarcity of negative labels. Unlike traditional sentiment analysis datasets, CC30k focuses on reproducibility-oriented sentiments, addressing a research gap in resources for computational reproducibility studies. The dataset was created through a pipeline that includes robust data cleansing, careful crowd selection, and thorough validation. The resulting dataset achieves a labeling accuracy of 94%. We then demonstrated that the performance of three large language models significantly improves on the reproducibility-oriented sentiment classification after fine-tuning using our dataset. The dataset lays the foundation for large-scale assessments of the reproducibility of machine learning papers. The CC30k dataset and the Jupyter notebooks used to produce and analyze the dataset are publicly available at https://github.com/lamps-lab/CC30k .

翻译：下游文献中对被引论文可重复性的情感反映了学术共同体的观点，并已显示出作为已发表成果实际可重复性的有效信号。为了训练能够有效预测面向可重复性情感并进一步系统研究其与可重复性相关性的模型，我们引入了CC30k数据集，该数据集包含机器学习论文中总计30,734个引文上下文。每个引文上下文均被标记为以下三种面向可重复性的情感标签之一：正面、负面或中性，以反映被引论文在可重复性或可复现性方面的感知情况。其中，25,829个样本通过众包标注，并通过受控流程生成的负面样本进行补充，以缓解负面标签稀缺的问题。与传统情感分析数据集不同，CC30k专注于面向可重复性的情感，填补了计算可重复性研究领域资源方面的空白。该数据集通过包含稳健数据清洗、审慎众包人员筛选和全面验证的流程构建而成，最终标注准确率达到94%。我们随后证明，三种大型语言模型在使用本数据集进行微调后，在面向可重复性的情感分类任务上性能显著提升。该数据集为大规模评估机器学习论文的可重复性奠定了基础。CC30k数据集及用于生成和分析该数据集的Jupyter笔记本已在https://github.com/lamps-lab/CC30k 公开提供。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日