可信医疗问答系统：一项以评估为核心的综述 (Trustworthy Medical Question Answering: An Evaluation-Centric Survey)

Trustworthiness in healthcare question-answering (QA) systems is important for ensuring patient safety, clinical effectiveness, and user confidence. As large language models (LLMs) become increasingly integrated into medical settings, the reliability of their responses directly influences clinical decision-making and patient outcomes. However, achieving comprehensive trustworthiness in medical QA poses significant challenges due to the inherent complexity of healthcare data, the critical nature of clinical scenarios, and the multifaceted dimensions of trustworthy AI. In this survey, we systematically examine six key dimensions of trustworthiness in medical QA, i.e., Factuality, Robustness, Fairness, Safety, Explainability, and Calibration. We review how each dimension is evaluated in existing LLM-based medical QA systems. We compile and compare major benchmarks designed to assess these dimensions and analyze evaluation-guided techniques that drive model improvements, such as retrieval-augmented grounding, adversarial fine-tuning, and safety alignment. Finally, we identify open challenges-such as scalable expert evaluation, integrated multi-dimensional metrics, and real-world deployment studies-and propose future research directions to advance the safe, reliable, and transparent deployment of LLM-powered medical QA.

翻译：医疗问答系统的可信性对于保障患者安全、临床效果及用户信心至关重要。随着大语言模型日益融入医疗场景，其回答的可靠性直接影响临床决策与患者预后。然而，由于医疗数据固有的复杂性、临床场景的严苛性以及可信人工智能的多维度特性，实现医疗问答系统的全面可信仍面临重大挑战。本综述系统性地探讨了医疗问答系统中可信性的六个关键维度，即事实性、鲁棒性、公平性、安全性、可解释性与校准性。我们回顾了现有基于大语言模型的医疗问答系统如何评估各维度，整理并比较了针对这些维度设计的主要基准测试，同时分析了驱动模型改进的评估导向技术，如检索增强式知识锚定、对抗性微调与安全对齐。最后，我们指出了当前面临的开放挑战——例如可扩展的专家评估、融合多维度指标及真实场景部署研究——并提出了未来研究方向，以推动基于大语言模型的医疗问答系统实现安全、可靠且透明的部署。