The recent Segment Anything Model (SAM) 3 has introduced significant advancements over its predecessor, SAM 2, particularly with the integration of language-based segmentation and enhanced 3D perception capabilities. SAM 3 supports zero-shot segmentation across a wide range of prompts, including point, bounding box, and language-based prompts, allowing for more flexible and intuitive interactions with the model. In this empirical evaluation, we assess the performance of SAM 3 in robot-assisted surgery, benchmarking its zero-shot segmentation with point and bounding box prompts and exploring its effectiveness in dynamic video tracking, alongside its newly introduced language prompt segmentation. While language prompts show potential, their performance in the surgical domain is currently suboptimal, highlighting the need for further domain-specific training. Additionally, we investigate SAM 3's 3D reconstruction abilities, demonstrating its capacity to process surgical scene data and reconstruct 3D anatomical structures from 2D images. Through comprehensive testing on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 3 shows clear improvements over SAM and SAM 2 in both image and video segmentation under spatial prompts, while zero-shot evaluations on SCARED, StereoMIS, and EndoNeRF indicate strong monocular depth estimation and realistic 3D instrument reconstruction, yet also reveal remaining limitations in complex, highly dynamic surgical scenes.
翻译:近期提出的Segment Anything Model (SAM) 3相较于其前代SAM 2取得了显著进展,特别是在集成基于语言的分割和增强的三维感知能力方面。SAM 3支持基于多种提示的零样本分割,包括点提示、边界框提示和语言提示,使得与模型的交互更加灵活直观。在本实证评估中,我们评估了SAM 3在机器人辅助手术中的性能,对其基于点和边界框提示的零样本分割进行了基准测试,并探索了其在动态视频跟踪以及新引入的语言提示分割方面的有效性。尽管语言提示展现出潜力,但其在手术领域中的表现目前尚不理想,凸显了进一步领域特定训练的必要性。此外,我们研究了SAM 3的三维重建能力,展示了其处理手术场景数据并从二维图像重建三维解剖结构的能力。通过对MICCAI EndoVis 2017和EndoVis 2018基准的全面测试,SAM 3在空间提示下的图像和视频分割方面均显示出相较于SAM和SAM 2的明显改进。同时,在SCARED、StereoMIS和EndoNeRF上的零样本评估表明,其在单目深度估计和真实感三维器械重建方面表现出色,但也揭示了在复杂、高度动态的手术场景中仍存在的局限性。