M2R2：面向时序动作分割的多模态机器人表征 (M2R2: MultiModal Robotic Representation for Temporal Action Segmentation)

Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models in robotics integrate feature fusion within the model, making it difficult to reuse learned features across different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision struggle in scenarios with limited object visibility. In this work, we address these challenges by proposing M2R2, a multimodal feature extractor tailored for TAS, which combines information from both proprioceptive and exteroceptive sensors. We introduce a novel pretraining strategy that enables the reuse of learned features across multiple TAS models. Our method achieves state-of-the-art performance on the REASSEMBLE dataset, a challenging multimodal robotic assembly dataset, outperforming existing robotic action segmentation models by 46.6%. Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities in robotic TAS tasks.

翻译：时序动作分割（TAS）长期以来一直是机器人与计算机视觉领域的关键研究方向。在机器人学中，算法主要侧重于利用本体感知信息来判定技能边界，近期手术机器人领域的方法已开始融入视觉信息。相比之下，计算机视觉通常依赖外感受传感器（如相机）。现有机器人多模态TAS模型将特征融合内置于模型内部，导致学习到的特征难以在不同模型间复用。同时，计算机视觉领域常用的预训练纯视觉特征提取器在物体可见性受限的场景中表现不佳。本研究通过提出M2R2——一种专为TAS设计的融合本体感知与外感受传感器信息的多模态特征提取器——应对这些挑战。我们引入了一种新颖的预训练策略，使得学习特征能在多个TAS模型间复用。该方法在REASSEMBLE数据集（一个具有挑战性的多模态机器人装配数据集）上实现了最先进的性能，以46.6%的优势超越现有机器人动作分割模型。此外，我们通过详尽的消融实验评估了不同模态在机器人TAS任务中的贡献度。