Hulu-Med：面向整体医学视觉-语言理解的透明通用模型 (Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding)

Songtao Jiang,Yuan Wang,Sibo Song,Tianxiang Hu,Chenyi Zhou,Bin Pu,Yan Zhang,Zhibo Yang,Yang Feng,Joey Tianyi Zhou,Jin Hao,Zijian Chen,Ruijia Wu,Tao Tang,Junhui Lv,Hongxia Xu,Hongwei Wang,Jun Xiao,Bin Feng,Fudong Zhu,Kenli Li,Weidi Xie,Jimeng Sun,Jian Wu,Zuozhu Liu

Real-world clinical decision-making requires integrating heterogeneous data, including medical text, 2D images, 3D volumes, and videos, while existing AI systems fail to unify all these signals, limiting their utility. In this paper, we introduce Hulu-Med, a transparent, generalist medical Vision-Language Model (VLM) designed to unify language-only, 2D/3D vision-language, and video understanding within a single architecture. Hulu-Med is trained on a curated corpus of 16.7 million samples, comprising exclusively public or synthetic data, spanning 12 major anatomical systems and 14 medical imaging modalities. Hulu-Med employs a medical-aware token-reduction strategy that prunes redundant visual tokens, achieving up to a 55% reduction for 3D and video inputs, improving cross-modal efficiency, and enabling training at 7B-32B parameter scales in approximately 4,000-40,000 GPU hours. Across 30 public in-domain and out-of-domain medical benchmarks-covering text reasoning, visual question answering, report generation, multilingual dialogue, video understanding, and rare disease diagnosis-Hulu-Med surpasses existing open-source models on 27 of 30 benchmarks and outperforms proprietary systems such as GPT-4o on 16 benchmarks. Despite being a VLM, Hulu-Med outperforms GPT-4o and matches GPT-o1 on the text-only HealthBench. For the first time in the community, we provide a fully transparent, reproducible and cost-effective pipeline for holistic medical vision-language understanding by releasing our end-to-end data curation, training procedures, and model parameters. Code and models are available at https://github.com/ZJUI-AI4H/Hulu-Med.

翻译：真实世界的临床决策需要整合异构数据，包括医学文本、二维图像、三维体数据和视频，而现有的人工智能系统未能统一所有这些信号，限制了其实用性。本文介绍了Hulu-Med，一种透明的通用医学视觉-语言模型，旨在将纯语言、二维/三维视觉-语言和视频理解统一于单一架构中。Hulu-Med在包含1670万个样本的精选语料库上进行训练，该语料库完全由公开或合成数据构成，涵盖12个主要解剖系统和14种医学成像模态。Hulu-Med采用一种医学感知的令牌缩减策略，通过剪枝冗余的视觉令牌，对三维和视频输入实现了高达55%的缩减，提升了跨模态效率，并使得在约4000至40000 GPU小时内训练70亿至320亿参数规模的模型成为可能。在涵盖文本推理、视觉问答、报告生成、多语言对话、视频理解和罕见疾病诊断的30个公开领域内及领域外医学基准测试中，Hulu-Med在30个基准中的27个上超越了现有开源模型，并在16个基准上优于GPT-4o等专有系统。尽管作为视觉-语言模型，Hulu-Med在纯文本的HealthBench基准上超越了GPT-4o，并与GPT-o1表现相当。我们首次在社区中通过发布端到端的数据整理、训练流程和模型参数，为整体医学视觉-语言理解提供了一个完全透明、可复现且成本效益高的流程。代码和模型可在https://github.com/ZJUI-AI4H/Hulu-Med获取。