VERM：利用基础模型构建虚拟视点以实现高效三维机器人操控 (VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation)

When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89x speedup in training time and 1.54x speedup in inference speed. More results can be found on our project website at https://verm-ral.github.io .

翻译：在执行三维操控任务时，机器人需基于多个固定摄像头的感知进行动作规划。多摄像头配置引入了大量冗余及无关信息，这不仅增加了计算成本，还迫使模型耗费额外训练时间提取关键任务相关细节。为滤除冗余信息并精确提取任务相关特征，我们提出VERM（机器人操控虚拟视点）方法，利用基础模型中的知识，从构建的三维点云中想象出一个虚拟的任务自适应视点，该视点能高效捕获必要信息并减轻遮挡影响。为促进三维动作规划与细粒度操控，我们进一步设计了深度感知模块及动态由粗到精的处理流程。在仿真基准RLBench及真实场景评估中的大量实验结果验证了本方法的有效性，其性能超越先前最优方法，同时实现了1.89倍的训练加速与1.54倍的推理加速。更多结果请访问项目网站：https://verm-ral.github.io。