Online, real-time, and fine-grained 3D segmentation constitutes a fundamental capability for embodied intelligent agents to perceive and comprehend their operational environments. Recent advancements employ predefined object queries to aggregate semantic information from Vision Foundation Models (VFMs) outputs that are lifted into 3D point clouds, facilitating spatial information propagation through inter-query interactions. Nevertheless, perception is an inherently dynamic process, rendering temporal understanding a critical yet overlooked dimension within these prevailing query-based pipelines. Therefore, to further unlock the temporal environmental perception capabilities of embodied agents, our work reconceptualizes online 3D segmentation as an instance tracking problem (AutoSeg3D). Our core strategy involves utilizing object queries for temporal information propagation, where long-term instance association promotes the coherence of features and object identities, while short-term instance update enriches instant observations. Given that viewpoint variations in embodied robotics often lead to partial object visibility across frames, this mechanism aids the model in developing a holistic object understanding beyond incomplete instantaneous views. Furthermore, we introduce spatial consistency learning to mitigate the fragmentation problem inherent in VFMs, yielding more comprehensive instance information for enhancing the efficacy of both long-term and short-term temporal learning. The temporal information exchange and consistency learning facilitated by these sparse object queries not only enhance spatial comprehension but also circumvent the computational burden associated with dense temporal point cloud interactions. Our method establishes a new state-of-the-art, surpassing ESAM by 2.8 AP on ScanNet200 and delivering consistent gains on ScanNet, SceneNN, and 3RScan datasets.
翻译:在线、实时且细粒度的三维分割是具身智能体感知和理解其操作环境的基本能力。近期研究采用预定义的对象查询来聚合从视觉基础模型(VFMs)输出中提取并提升至三维点云的语义信息,通过查询间的交互促进空间信息传播。然而,感知本质上是一个动态过程,使得时间理解成为这些主流基于查询的流程中关键但被忽视的维度。因此,为进一步释放具身代理的时间环境感知能力,我们的工作将在线三维分割重新构想为一个实例跟踪问题(AutoSeg3D)。我们的核心策略涉及利用对象查询进行时间信息传播,其中长期实例关联促进特征和对象身份的一致性,而短期实例更新则丰富即时观测。鉴于具身机器人中的视角变化常导致跨帧的对象部分可见性,该机制帮助模型发展出超越不完整瞬时视图的整体对象理解。此外,我们引入空间一致性学习以缓解VFMs固有的碎片化问题,产生更全面的实例信息,从而增强长期和短期时间学习的效能。这些稀疏对象查询促进的时间信息交换和一致性学习不仅提升了空间理解,还避免了密集时间点云交互相关的计算负担。我们的方法建立了新的最先进水平,在ScanNet200上超越ESAM 2.8 AP,并在ScanNet、SceneNN和3RScan数据集上实现了一致的性能提升。