Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control.
翻译:有效的接触密集型操作要求机器人协同利用视觉、力觉和本体感知。然而,强化学习智能体在此类多感官环境中学习困难,尤其是在感官噪声和动态变化的情况下。我们提出多感官动态预训练(MSDP),一种专为面向任务的策略学习而设计的表达性多感官表征学习框架。MSDP基于掩码自编码,通过仅从传感器嵌入子集重构多感官观测来训练基于Transformer的编码器,从而实现跨模态预测与传感器融合。对于下游策略学习,我们引入一种新颖的非对称架构:其中交叉注意力机制使评论家能够从冻结嵌入中提取动态的、任务特定的特征,而执行器接收稳定的池化表征以指导其行动。我们的方法在多种扰动(包括传感器噪声和物体动力学变化)下展现出加速学习与鲁棒性能。在仿真和真实世界的多个挑战性接触密集型机器人操作任务中的评估验证了MSDP的有效性。该方法对扰动表现出强鲁棒性,在真实机器人上仅需6,000次在线交互即可实现高成功率,为复杂多感官机器人控制提供了一种简洁而强大的解决方案。