Modern LLMs can now produce highly readable abstractive summaries, to the point that traditional automated metrics for evaluating summary quality, such as ROUGE, have saturated. However, LLMs still sometimes introduce inaccuracies into summaries, i.e., information inconsistent with or unsupported by the corresponding source. Measuring the occurrence of these often subtle factual inconsistencies automatically has proved challenging. This in turn has motivated development of metrics intended to measure the factual consistency of generated summaries against sources. But are these approaches measuring what they purport to? Or are they mostly exploiting artifacts? In this work, we stress test a range of automatic factuality metrics, including specialized models and LLM-based prompting methods, to probe what they actually capture. Using a shallow classifier to separate ``easy'' examples for factual evaluation where surface features suffice from ``hard'' cases requiring deeper reasoning, we find that all metrics show substantial performance drops on the latter. Furthermore, some metrics are more sensitive to benign, fact-preserving edits than to factual corrections. Building on this observation, we demonstrate that most automatic factuality metrics can be gamed, i.e., their scores can be artificially inflated by appending innocuous, content-free sentences to summaries. Among the metrics tested, the prompt based ChatGPT-DA approach is the most robust and reliable. However, this comes with a notable caveat: Prompting LLMs to assess factuality may overly rely on their parametric knowledge rather than the provided reference when making judgments. Taken together, our findings call into question the reliability of current factuality metrics and prompt a broader reflection on what these metrics are truly measuring.
翻译:现代大型语言模型(LLMs)已能生成高度可读的抽象摘要,以至于传统评估摘要质量的自动化指标(如ROUGE)已趋于饱和。然而,LLMs有时仍会在摘要中引入不准确信息,即与对应源内容不一致或缺乏支持的信息。自动检测这些通常微妙的事实不一致现象已被证明具有挑战性。这进而推动了旨在衡量生成摘要相对于源文本的事实一致性的度量标准的发展。但这些方法是否真正测量了它们声称的内容?还是主要利用了数据伪影?在本研究中,我们对一系列自动事实性度量标准(包括专用模型和基于LLM的提示方法)进行压力测试,以探究它们实际捕捉的特征。通过使用浅层分类器将表面特征即可判断的“简单”事实评估示例与需要深度推理的“困难”案例分离,我们发现所有度量标准在后一类案例上均表现出显著的性能下降。此外,某些度量标准对保持事实的良性编辑比对事实修正更为敏感。基于此观察,我们证明大多数自动事实性度量标准可能被操纵——即通过在摘要后附加无关紧要、无实质内容的句子可人为提升其评分。在测试的度量标准中,基于提示的ChatGPT-DA方法最为稳健可靠。但需注意一个重要限制:要求LLMs评估事实性时,其判断可能过度依赖参数化知识而非提供的参考依据。综合而言,我们的研究结果对当前事实性度量标准的可靠性提出质疑,并促使学界更广泛地反思这些度量标准实际衡量的本质。