While RAG demonstrates remarkable capabilities in LLM applications, its effectiveness is hindered by the ever-increasing length of retrieved contexts, which introduces information redundancy and substantial computational overhead. Existing context pruning methods, such as LLMLingua, lack contextual awareness and offer limited flexibility in controlling compression rates, often resulting in either insufficient pruning or excessive information loss. In this paper, we propose AttentionRAG, an attention-guided context pruning method for RAG systems. The core idea of AttentionRAG lies in its attention focus mechanism, which reformulates RAG queries into a next-token prediction paradigm. This mechanism isolates the query's semantic focus to a single token, enabling precise and efficient attention calculation between queries and retrieved contexts. Extensive experiments on LongBench and Babilong benchmarks show that AttentionRAG achieves up to 6.3$\times$ context compression while outperforming LLMLingua methods by around 10\% in key metrics.
翻译:尽管检索增强生成(RAG)在大型语言模型应用中展现出卓越能力,但其效果受到检索上下文长度持续增加的限制,这引入了信息冗余和显著的计算开销。现有的上下文剪枝方法(如LLMLingua)缺乏上下文感知能力,且在控制压缩率方面灵活性有限,常导致剪枝不足或信息过度丢失。本文提出AttentionRAG,一种面向RAG系统的注意力引导上下文剪枝方法。AttentionRAG的核心思想在于其注意力聚焦机制,该机制将RAG查询重新表述为下一词元预测范式。此机制将查询的语义焦点隔离至单个词元,从而实现查询与检索上下文之间精确高效的注意力计算。在LongBench和Babilong基准上的大量实验表明,AttentionRAG在关键指标上超越LLMLingua方法约10%的同时,实现了高达6.3倍的上下文压缩。