Despite the popularity of retrieval-augmented generation (RAG) as a solution for grounded QA in both academia and industry, current RAG methods struggle with questions where the necessary information is distributed across many documents or where retrieval needs to be combined with complex reasoning. Recently, the LOFT study has shown that this limitation also applies to approaches based on long-context language models, with the QUEST benchmark exhibiting particularly large headroom. In this paper, we provide an in-depth analysis of the factors contributing to the poor performance on QUEST-LOFT, publish updated numbers based on a thorough human evaluation, and demonstrate that RAG can be optimized to significantly outperform long-context approaches when combined with a structured output format containing reasoning and evidence, optionally followed by answer re-verification.
翻译:尽管检索增强生成(RAG)作为学术界和工业界中基于事实的问答解决方案广受欢迎,但当前的RAG方法在处理所需信息分散于多篇文档或需要将检索与复杂推理相结合的问题时仍面临挑战。近期LOFT研究表明,这一局限性同样存在于基于长上下文语言模型的方法中,而QUEST基准测试尤其显示出巨大的性能提升空间。本文深入分析了导致QUEST-LOFT表现不佳的因素,基于全面人工评估发布了更新数据,并证明当RAG与包含推理和证据的结构化输出格式相结合(可选择性地辅以答案重验证)时,其优化版本能显著超越长上下文方法。