我们能否将机器隐藏在人群中？量化LLM参与标注任务中的等效性 (Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks)

Many evaluations of large language models (LLMs) in text annotation focus primarily on the correctness of the output, typically comparing model-generated labels to human-annotated ``ground truth'' using standard performance metrics. In contrast, our study moves beyond effectiveness alone. We aim to explore how labeling decisions -- by both humans and LLMs -- can be statistically evaluated across individuals. Rather than treating LLMs purely as annotation systems, we approach LLMs as an alternative annotation mechanism that may be capable of mimicking the subjective judgments made by humans. To assess this, we develop a statistical evaluation method based on Krippendorff's $\alpha$, paired bootstrapping, and the Two One-Sided t-Tests (TOST) equivalence test procedure. This evaluation method tests whether an LLM can blend into a group of human annotators without being distinguishable. We apply this approach to two datasets -- MovieLens 100K and PolitiFact -- and find that the LLM is statistically indistinguishable from a human annotator in the former ($p = 0.004$), but not in the latter ($p = 0.155$), highlighting task-dependent differences. It also enables early evaluation on a small sample of human data to inform whether LLMs are suitable for large-scale annotation in a given application.

翻译：许多针对大语言模型（LLM）在文本标注任务中的评估主要关注输出结果的正确性，通常使用标准性能指标将模型生成的标签与人工标注的“真实标签”进行比较。相比之下，本研究不仅关注有效性。我们旨在探索如何从统计角度评估个体（包括人类和LLM）的标注决策。我们并非将LLM单纯视为标注系统，而是将其作为一种可能模拟人类主观判断的替代标注机制。为评估这一点，我们开发了一种基于克里彭多夫α系数、配对自助法和双单侧t检验（TOST）等效性检验程序的统计评估方法。该方法检验LLM是否能融入一组人类标注者中而不被区分。我们将此方法应用于两个数据集——MovieLens 100K和PolitiFact——发现LLM在前者中与人类标注者在统计上无法区分（p = 0.004），但在后者中可区分（p = 0.155），凸显了任务依赖性差异。该方法还能基于少量人类数据样本进行早期评估，以判断LLM在特定应用中是否适用于大规模标注任务。