Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks. However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness, i.e., the ability to recognize missing or critical details (e.g., user identity, medical history, risk factors) and provide safe, helpful, and contextually appropriate responses. To address this issue, we propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs' context-awareness along three key facets (decision-making, communication, and safety) through self-evaluation and refinement. Specifically, we first design a attribute-conditioned query generator that simulates diverse real-world user contexts by varying attributes such as role, geographic region, intent, and degree of information ambiguity. An LLM then responds to these queries, self-evaluates its answers along three key facets, and refines its responses to better align with the requirements of each facet. Finally, the queries and refined responses are used for supervised fine-tuning to reinforce the model's context-awareness ability. Evaluation results on the latest HealthBench dataset demonstrate that our method significantly improves LLM performance across multiple aspects, with particularly notable gains in the context-awareness axis. Furthermore, by incorporating knowledge distillation with the proposed method, the performance of a smaller backbone LLM (e.g., Qwen3-32B) surpasses its teacher model, achieving a new SOTA across all open-source LLMs on HealthBench (63.8%) and its hard subset (43.1%). Code and dataset will be released at https://muser-llm.github.io.
翻译:大语言模型(LLMs)在医学领域展现出巨大潜力,在多个基准测试中取得了优异性能。然而,其在真实世界医疗场景中的表现仍显不足,这些场景通常要求更强的情境感知能力——即识别缺失或关键细节(如用户身份、病史、风险因素)并提供安全、有用且情境适配的响应。为解决这一问题,我们提出多维度自优化(MuSeR)方法,这是一种数据驱动的技术,通过自我评估与优化,从三个关键维度(决策制定、沟通交流、安全性)增强LLMs的情境感知能力。具体而言,我们首先设计了一个属性条件查询生成器,通过调整角色、地理区域、意图、信息模糊度等属性来模拟多样化的真实用户情境。随后,LLM对这些查询生成响应,从三个关键维度进行自我评估,并优化其回答以更好地契合各维度的要求。最后,将查询与优化后的响应用于监督微调,以强化模型的情境感知能力。在最新HealthBench数据集上的评估结果表明,我们的方法显著提升了LLM在多个方面的性能,尤其在情境感知维度上取得了突出进展。此外,通过将知识蒸馏与本方法结合,较小规模的基础LLM(如Qwen3-32B)的性能超越了其教师模型,在HealthBench(63.8%)及其困难子集(43.1%)上实现了所有开源LLM中的新最优性能。代码与数据集将在https://muser-llm.github.io发布。