The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources (HL7 FHIR) standard opens a new frontier for clinical AI, demanding LLM agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent LLMs on interoperable clinical data. To bridge this gap, we introduce FHIR-AgentBench, a benchmark that grounds 2,931 real-world clinical questions in the HL7 FHIR standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct FHIR API calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate FHIR resources and the difficulty of reasoning over them, both of which critically affect question answering performance. We publicly release the FHIR-AgentBench dataset and evaluation suite (https://github.com/glee4810/FHIR-AgentBench) to promote reproducible research and the development of robust, reliable LLM agents for clinical applications.
翻译:近期向HL7 FHIR(健康信息交换第七层快速医疗互操作性资源)标准的转变,为临床人工智能开辟了新前沿,要求大型语言模型(LLM)智能体能够驾驭复杂的、基于资源的数据模型,而非传统的结构化健康数据。然而,现有基准测试未能跟上这一转型步伐,缺乏评估LLM在可互操作临床数据上表现所需的真实性。为弥补这一差距,我们提出了FHIR-AgentBench,这是一个基于HL7 FHIR标准构建的、包含2,931个真实世界临床问题的基准测试。利用该基准,我们系统评估了智能体框架,比较了不同的数据检索策略(直接FHIR API调用与专用工具)、交互模式(单轮与多轮)以及推理策略(自然语言与代码生成)。我们的实验凸显了从复杂FHIR资源中检索数据的实际挑战,以及对其进行推理的困难,这两者均对问答性能产生关键影响。我们公开发布了FHIR-AgentBench数据集与评估套件(https://github.com/glee4810/FHIR-AgentBench),以促进可复现研究及面向临床应用开发稳健可靠的LLM智能体。