Automatic polyp segmentation is crucial for improving the clinical identification of colorectal cancer (CRC). While Deep Learning (DL) techniques have been extensively researched for this problem, current methods frequently struggle with generalization, particularly in data-constrained or challenging settings. Moreover, many existing polyp segmentation methods rely on complex, task-specific architectures. To address these limitations, we present a framework that leverages the intrinsic robustness of DINO self-attention "key" features for robust segmentation. Unlike traditional methods that extract tokens from the deepest layers of the Vision Transformer (ViT), our approach leverages the key features of the self-attention module with a simple convolutional decoder to predict polyp masks, resulting in enhanced performance and better generalizability. We validate our approach using a multi-center dataset under two rigorous protocols: Domain Generalization (DG) and Extreme Single Domain Generalization (ESDG). Our results, supported by a comprehensive statistical analysis, demonstrate that this pipeline achieves state-of-the-art (SOTA) performance, significantly enhancing generalization, particularly in data-scarce and challenging scenarios. While avoiding a polyp-specific architecture, we surpass well-established models like nnU-Net and UM-Net. Additionally, we provide a systematic benchmark of the DINO framework's evolution, quantifying the specific impact of architectural advancements on downstream polyp segmentation performance.
翻译:自动息肉分割对于提升结直肠癌(CRC)的临床识别至关重要。尽管深度学习(DL)技术在此问题上已得到广泛研究,但现有方法常面临泛化能力不足的挑战,尤其在数据受限或复杂场景中。此外,许多现有息肉分割方法依赖于复杂的任务专用架构。为应对这些局限,本文提出一种框架,利用DINO自注意力“键”特征的内在鲁棒性实现稳健分割。与传统方法从视觉Transformer(ViT)最深层提取令牌不同,我们的方法结合自注意力模块的键特征与简单卷积解码器来预测息肉掩膜,从而提升性能并增强泛化能力。我们通过多中心数据集在两种严格协议下验证方法:领域泛化(DG)与极端单领域泛化(ESDG)。经全面统计分析支持的结果表明,该流程实现了最先进(SOTA)性能,显著提升了泛化能力,尤其在数据稀缺和挑战性场景中。在避免使用息肉专用架构的同时,我们超越了nnU-Net和UM-Net等成熟模型。此外,我们系统性地评估了DINO框架的演进历程,量化了架构进步对下游息肉分割性能的具体影响。