The LUMIR challenge represents an important benchmark for evaluating deformable image registration methods on large-scale neuroimaging data. While the challenge demonstrates that modern deep learning methods achieve competitive accuracy on T1-weighted MRI, it also claims exceptional zero-shot generalization to unseen contrasts and resolutions, assertions that contradict established understanding of domain shift in deep learning. In this paper, we perform an independent re-evaluation of these zero-shot claims using rigorous evaluation protocols while addressing potential sources of instrumentation bias. Our findings reveal a more nuanced picture: (1) deep learning methods perform comparably to iterative optimization on in-distribution T1w images and even on human-adjacent species (macaque), demonstrating improved task understanding; (2) however, performance degrades significantly on out-of-distribution contrasts (T2, T2*, FLAIR), with Cohen's d scores ranging from 0.7-1.5, indicating substantial practical impact on downstream clinical workflows; (3) deep learning methods face scalability limitations on high-resolution data, failing to run on 0.6 mm isotropic images, while iterative methods benefit from increased resolution; and (4) deep methods exhibit high sensitivity to preprocessing choices. These results align with the well-established literature on domain shift and suggest that claims of universal zero-shot superiority require careful scrutiny. We advocate for evaluation protocols that reflect practical clinical and research workflows rather than conditions that may inadvertently favor particular method classes.
翻译:LUMIR挑战是评估大规模神经影像数据上可变形图像配准方法的重要基准。尽管该挑战表明现代深度学习方法在T1加权MRI上达到了具有竞争力的精度,但其声称在未见过的对比度和分辨率上具有卓越的零样本泛化能力,这一论断与深度学习领域偏移的既定理解相悖。本文通过严格的评估协议,同时考虑仪器偏差的潜在来源,对这些零样本声明进行了独立的重新评估。我们的研究结果揭示了一个更为细致的情况:(1)深度学习方法在分布内T1w图像乃至近人物种(猕猴)上的表现与迭代优化方法相当,显示出对任务理解的提升;(2)然而,在分布外对比度(T2、T2*、FLAIR)上性能显著下降,Cohen's d分数范围在0.7至1.5之间,表明对下游临床工作流程具有实质性实际影响;(3)深度学习方法在高分辨率数据上存在可扩展性限制,无法在0.6毫米各向同性图像上运行,而迭代方法则受益于分辨率的提高;(4)深度方法对预处理选择表现出高度敏感性。这些结果与领域偏移的成熟文献一致,表明关于普遍零样本优越性的声明需要仔细审视。我们主张采用反映实际临床和研究工作流程的评估协议,而非可能无意中偏袒特定方法类别的条件。