MTBBench：肿瘤学中的多模态序列临床决策基准 (MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology)

Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show that, even at scale, they lack reliability -- frequently hallucinating, struggling with reasoning from time-resolved data, and failing to reconcile conflicting evidence or different modalities. To address these limitations, MTBBench goes beyond benchmarking by providing an agentic framework with foundation model-based tools that enhance multi-modal and longitudinal reasoning, leading to task-level performance gains of up to 9.0% and 11.2%, respectively. Overall, MTBBench offers a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool-use with a focus on MTB environments in precision oncology.

翻译：多模态大语言模型在生物医学推理领域展现出潜力，但现有基准未能捕捉真实世界临床工作流程的复杂性。当前评估主要针对单模态、去情境化的问答任务，忽视了多智能体决策环境（如分子肿瘤委员会）。分子肿瘤委员会汇聚了肿瘤学领域的多学科专家，其诊断与预后任务需要整合异质性数据并随时间推移融合动态演进的临床见解。现有基准缺乏这种纵向与多模态的复杂性。我们提出MTBBench——一个通过具有临床挑战性、多模态且纵向演进的肿瘤学问题来模拟分子肿瘤委员会式决策的智能体基准。其真实标注经由临床医生通过协同开发的应用程序验证，确保临床相关性。我们对多个开源与闭源大语言模型进行基准测试，结果表明即使模型规模扩大，它们仍缺乏可靠性：经常产生幻觉性输出，难以基于时间序列数据进行推理，且无法协调冲突证据或不同模态信息。为应对这些局限，MTBBench不仅提供基准测试，还构建了包含基于基础模型工具的智能体框架，该框架能增强多模态与纵向推理能力，使任务级性能分别提升最高达9.0%与11.2%。总体而言，MTBBench为推进多模态大语言模型在精准肿瘤学分子肿瘤委员会环境中的推理能力、可靠性及工具使用提供了具有挑战性与现实意义的测试平台。