Recent vision-language foundation models deliver state-of-the-art results in natural image classification, but falter in medical images due to pronounced domain shifts. Training a medical foundation model also requires substantial resources, including extensive annotated data and high computational capacity. To bridge this gap with minimal overhead, we introduce MedBridge, a lightweight multimodal adaptation framework that flexibly re-purposes arbitrary pre-trained foundation VLMs for medical image diagnosis. MedBridge comprises three novel core components. First, a Focal Sampling module that subsamples and extracts high-resolution local regions to capture subtle pathological features, compensating for the limited input resolution of foundation VLMs. Second, a Query-Encoder model with a small set of learnable queries to align the feature maps of frozen VLMs with medical semantics, without requiring retraining of the backbone layers. Third, a Mixture of Experts mechanism, driven by learnable queries, harnesses the complementary strength of various VLMs to maximize diagnostic performance. We evaluate MedBridge on five chest radiograph benchmarks in three key adaptation tasks, demonstrating its superior performance in both cross-domain and in-domain adaptation settings under varying levels of training data availability. MedBridge achieved an improvement of 6-15% in AUC compared to state-of-the-art VLM adaptation methods in multi-label thoracic disease diagnosis, underscoring its effectiveness in leveraging diverse foundation models for accurate and data-efficient medical diagnosis. Our project and code are available at https://github.com/ai-med/MedBridge.
翻译:近期的基础视觉-语言模型在自然图像分类任务中取得了最先进的成果,但由于显著的领域偏移,其在医学图像上的表现不尽如人意。训练一个医学基础模型同样需要大量资源,包括广泛的标注数据和高计算能力。为了以最小开销弥合这一差距,我们提出了MedBridge,一个轻量级多模态适应框架,能够灵活地将任意预训练的基础视觉-语言模型重新应用于医学图像诊断。MedBridge包含三个新颖的核心组件。首先,一个焦点采样模块,通过子采样并提取高分辨率的局部区域来捕捉细微的病理特征,以弥补基础视觉-语言模型输入分辨率的限制。其次,一个带有少量可学习查询的查询-编码器模型,用于将冻结的视觉-语言模型的特征图与医学语义对齐,而无需重新训练骨干网络层。第三,一个由可学习查询驱动的专家混合机制,利用不同视觉-语言模型的互补优势,以最大化诊断性能。我们在三个关键适应任务中,基于五个胸部X射线基准数据集评估了MedBridge,展示了其在跨领域和领域内适应设置下,在不同训练数据可用性水平上的卓越性能。在多标签胸部疾病诊断中,MedBridge相较于最先进的视觉-语言模型适应方法,在AUC上实现了6-15%的提升,突显了其在利用多样化基础模型进行准确且数据高效的医学诊断方面的有效性。我们的项目和代码可在 https://github.com/ai-med/MedBridge 获取。