输入顺序塑造大语言模型在多文档摘要中的语义对齐 (Input Order Shapes LLM Semantic Alignment in Multi-Document Summarization)

Large language models (LLMs) are now used in settings such as Google's AI Overviews, where it summarizes multiple long documents. However, it remains unclear whether they weight all inputs equally. Focusing on abortion-related news, we construct 40 pro-neutral-con article triplets, permute each triplet into six input orders, and prompt Gemini 2.5 Flash to generate a neutral overview. We evaluate each summary against its source articles using ROUGE-L (lexical overlap), BERTScore (semantic similarity), and SummaC (factual consistency). One-way ANOVA reveals a significant primacy effect for BERTScore across all stances, indicating that summaries are more semantically aligned with the first-seen article. Pairwise comparisons further show that Position 1 differs significantly from Positions 2 and 3, while the latter two do not differ from each other, confirming a selective preference for the first document. The findings present risks for applications that rely on LLM-generated overviews and for agentic AI systems, where the steps involving LLMs can disproportionately influence downstream actions.

翻译：大语言模型（LLMs）目前已应用于诸如谷歌AI概览等场景，用于总结多篇长文档。然而，模型是否对所有输入赋予同等权重尚不明确。本研究聚焦于堕胎相关新闻报道，构建了40组支持-中立-反对立场的文章三元组，将每个三元组按六种顺序排列作为输入，并提示Gemini 2.5 Flash模型生成中立概述。我们使用ROUGE-L（词汇重叠度）、BERTScore（语义相似度）和SummaC（事实一致性）评估每个摘要与其源文章的匹配程度。单因素方差分析显示，在所有立场类别中，BERTScore均存在显著的首因效应，表明摘要的语义更倾向于与首篇输入文章对齐。成对比较进一步显示，位置1与位置2、3存在显著差异，而位置2与位置3之间无显著差异，这证实了模型对首篇文档的选择性偏好。该发现对依赖LLM生成概览的应用场景以及智能体AI系统构成风险，因为涉及LLM的步骤可能对下游行动产生不成比例的影响。