LiveRAG：一个用于RAG评估的、包含不同难度级别的多样化问答数据集 (LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation)

With Retrieval Augmented Generation (RAG) becoming more and more prominent in generative AI solutions, there is an emerging need for systematically evaluating their effectiveness. We introduce the LiveRAG benchmark, a publicly available dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based Q&A systems. This synthetic benchmark is derived from the one used during the SIGIR'2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints. It is augmented with information that was not made available to competitors during the Challenge, such as the ground-truth answers, together with their associated supporting claims which were used for evaluating competitors' answers. In addition, each question is associated with estimated difficulty and discriminability scores, derived from applying an Item Response Theory model to competitors' responses. Our analysis highlights the benchmark's questions diversity, the wide range of their difficulty levels, and their usefulness in differentiating between system capabilities. The LiveRAG benchmark will hopefully help the community advance RAG research, conduct systematic evaluation, and develop more robust Q&A systems.

翻译：随着检索增强生成（RAG）在生成式人工智能解决方案中日益突出，系统性地评估其有效性已成为一项迫切需求。我们推出了LiveRAG基准测试，这是一个包含895个合成问题与答案的公开数据集，旨在支持对基于RAG的问答系统进行系统性评估。该合成基准源自SIGIR'2025 LiveRAG挑战赛中使用的数据集，该挑战赛要求参赛者在严格的时间限制下接受评估。我们对数据集进行了增强，补充了挑战赛期间未向参赛者公开的信息，例如真实答案及其用于评估参赛者回答的相关支持性主张。此外，每个问题都关联了通过将项目反应理论模型应用于参赛者回答而估算出的难度和区分度分数。我们的分析突显了该基准测试问题的多样性、其难度级别的广泛分布，以及它们在区分系统能力方面的实用性。我们希望LiveRAG基准测试能够帮助社区推进RAG研究、开展系统性评估，并开发出更稳健的问答系统。