Medical Referring Image Segmentation (MRIS) involves segmenting target regions in medical images based on natural language descriptions. While achieving promising results, recent approaches usually involve complex design of multimodal fusion or multi-stage decoders. In this work, we propose NTP-MRISeg, a novel framework that reformulates MRIS as an autoregressive next-token prediction task over a unified multimodal sequence of tokenized image, text, and mask representations. This formulation streamlines model design by eliminating the need for modality-specific fusion and external segmentation models, supports a unified architecture for end-to-end training. It also enables the use of pretrained tokenizers from emerging large-scale multimodal models, enhancing generalization and adaptability. More importantly, to address challenges under this formulation-such as exposure bias, long-tail token distributions, and fine-grained lesion edges-we propose three novel strategies: (1) a Next-k Token Prediction (NkTP) scheme to reduce cumulative prediction errors, (2) Token-level Contrastive Learning (TCL) to enhance boundary sensitivity and mitigate long-tail distribution effects, and (3) a memory-based Hard Error Token (HET) optimization strategy that emphasizes difficult tokens during training. Extensive experiments on the QaTa-COV19 and MosMedData+ datasets demonstrate that NTP-MRISeg achieves new state-of-the-art performance, offering a streamlined and effective alternative to traditional MRIS pipelines.
翻译:医学指代图像分割(MRIS)旨在根据自然语言描述分割医学图像中的目标区域。尽管现有方法已取得显著成果,但其通常涉及复杂的多模态融合或多阶段解码器设计。在本研究中,我们提出了NTP-MRISeg,一种将MRIS重新定义为基于图像、文本和掩码表示的统一多模态标记序列的自回归下一标记预测任务的新框架。该形式化通过消除对模态特定融合和外部分割模型的需求简化了模型设计,支持端到端训练的统一架构。同时,它能够利用新兴大规模多模态模型的预训练标记器,增强泛化能力和适应性。更重要的是,为应对该形式化下的挑战——如曝光偏差、长尾标记分布和细粒度病灶边缘——我们提出了三种创新策略:(1)采用下一k标记预测方案以减少累积预测误差;(2)通过标记级对比学习增强边界敏感性并缓解长尾分布效应;(3)引入基于记忆的困难标记优化策略,在训练中重点强化困难标记。在QaTa-COV19和MosMedData+数据集上的大量实验表明,NTP-MRISeg实现了新的最优性能,为传统MRIS流程提供了一种简洁高效的替代方案。