Span annotation - annotating specific text features at the span level - can be used to evaluate texts where single-score metrics fail to provide actionable feedback. Until recently, span annotation was done by human annotators or fine-tuned models. In this paper, we study whether large language models (LLMs) can serve as an alternative to human annotators. We compare the abilities of LLMs to skilled human annotators on three span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. We show that overall, LLMs have only moderate inter-annotator agreement (IAA) with human annotators. However, we demonstrate that LLMs make errors at a similar rate as skilled crowdworkers. LLMs also produce annotations at a fraction of the cost per output annotation. We release the dataset of over 40k model and human span annotations for further research.
翻译:片段标注——在片段级别标注特定文本特征——可用于评估那些单一评分指标无法提供可操作反馈的文本。直到最近,片段标注仍由人类标注者或微调模型完成。本文研究了大型语言模型(LLMs)能否作为人类标注者的替代方案。我们在三个片段标注任务上比较了LLMs与熟练人类标注者的能力:评估数据到文本生成、识别翻译错误以及检测宣传技巧。结果表明,总体而言,LLMs与人类标注者之间仅具有中等程度的标注者间一致性(IAA)。然而,我们证明LLMs的错误率与熟练众包工作者相近。此外,LLMs生成每个标注的成本仅为人类标注的一小部分。我们发布了包含超过4万条模型与人类片段标注的数据集,以供进一步研究。