While the reasoning capabilities of Large Language Models (LLMs) excel in analytical tasks such as mathematics and code generation, their utility for abstractive summarization remains widely assumed but largely unverified. To bridge this gap, we first tailor general reasoning strategies to the summarization domain. We then conduct a systematic, large scale comparative study of 8 reasoning strategies and 3 Large Reasoning Models (LRMs) across 8 diverse datasets, assessing both summary quality and faithfulness. Our findings show that reasoning is not a universal solution and its effectiveness is highly dependent on the specific strategy and context. Specifically, we observe a trade-off between summary quality and factual faithfulness: explicit reasoning strategies tend to improve fluency at the expense of factual grounding, while implicit reasoning in LRMs exhibits the inverse pattern. Furthermore, increasing an LRM's internal reasoning budget does not improve, and can even hurt, factual consistency, suggesting that effective summarization demands faithful compression rather than creative over-thinking.
翻译:尽管大语言模型(LLMs)在数学与代码生成等分析性任务中展现出卓越的推理能力,但其在抽象摘要任务中的实用性虽被广泛假设,却尚未得到充分验证。为弥合这一认知差距,我们首先将通用推理策略适配至摘要生成领域。随后,我们系统性地开展了大规模比较研究,涵盖8种推理策略与3种大型推理模型(LRMs),在8个多样化数据集上评估了摘要质量与忠实度。研究发现,推理并非普适性解决方案,其有效性高度依赖于具体策略与上下文情境。具体而言,我们观察到摘要质量与事实忠实性之间存在权衡:显式推理策略倾向于提升流畅性,但会牺牲事实依据;而LRMs中的隐式推理则呈现相反模式。此外,增加LRM的内部推理预算不仅无法提升事实一致性,甚至可能损害该指标,这表明有效的摘要生成需要忠实压缩而非创造性的过度推理。