Large vision language models (VLMs) have achieved impressive performance on medical visual question answering benchmarks, yet their reliance on visual information remains unclear. We investigate whether frontier VLMs demonstrate genuine visual grounding when answering Italian medical questions by testing four state-of-the-art models: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. Using 60 questions from the EuropeMedQA Italian dataset that explicitly require image interpretation, we substitute correct medical images with blank placeholders to test whether models truly integrate visual and textual information. Our results reveal striking variability in visual dependency: GPT-4o shows the strongest visual grounding with a 27.9pp accuracy drop (83.2% [74.6%, 91.7%] to 55.3% [44.1%, 66.6%]), while GPT-5-mini, Gemini, and Claude maintain high accuracy with modest drops of 8.5pp, 2.4pp, and 5.6pp respectively. Analysis of model-generated reasoning reveals confident explanations for fabricated visual interpretations across all models, suggesting varying degrees of reliance on textual shortcuts versus genuine visual analysis. These findings highlight critical differences in model robustness and the need for rigorous evaluation before clinical deployment.
翻译:大型视觉语言模型(VLMs)在医学视觉问答基准测试中取得了令人印象深刻的性能,但其对视觉信息的依赖程度仍不明确。本研究通过测试四种前沿模型——Claude Sonnet 4.5、GPT-4o、GPT-5-mini和Gemini 2.0 flash exp——探究这些模型在回答意大利语医学问题时是否展现出真正的视觉基础。我们使用来自EuropeMedQA意大利数据集的60个明确需要图像解读的问题,将正确的医学图像替换为空白占位符,以测试模型是否真正整合了视觉与文本信息。结果显示视觉依赖性存在显著差异:GPT-4o表现出最强的视觉基础,准确率下降27.9个百分点(从83.2% [74.6%, 91.7%]降至55.3% [44.1%, 66.6%]),而GPT-5-mini、Gemini和Claude则保持较高准确率,分别仅下降8.5、2.4和5.6个百分点。对模型生成推理的分析表明,所有模型均能对虚构的视觉解读提供自信的解释,这暗示了模型在依赖文本捷径与真实视觉分析之间存在不同程度的权衡。这些发现凸显了模型鲁棒性的关键差异,并强调了临床部署前进行严格评估的必要性。