语音控控机器人学习视觉-视听演示 (Learning Visual-Audio Representations for Voice-Controlled Robots)

Inspired by sensorimotor theory, we propose a novel pipeline for task-oriented voice-controlled robots. Previous method relies on a large amount of labels as well as task-specific reward functions. Not only can such an approach hardly be improved after the deployment, but also has limited generalization across robotic platforms and tasks. To address these problems, we learn a visual-audio representation (VAR) that associates images and sound commands with minimal supervision. Using this representation, we generate an intrinsic reward function to learn robot policies with reinforcement learning, which eliminates the laborious reward engineering process. We demonstrate our approach on various robotic platforms, where the robots hear an audio command, identify the associated target object, and perform precise control to fulfill the sound command. We show that our method outperforms previous work across various sound types and robotic tasks even with fewer amount of labels. We successfully deploy the policy learned in a simulator to a real Kinova Gen3. We also demonstrate that our VAR and the intrinsic reward function allows the robot to improve itself using only a small amount of labeled data collected in the real world.

翻译：在感官模拟理论的启发下,我们为任务导向的声音控制机器人提出了一个全新的管道。先前的方法依赖于大量标签和任务特定奖赏功能。不仅在部署后这种方法很难改进, 而且限制了机器人平台和任务的普及性。为了解决这些问题, 我们学习了一个视觉- 视觉代表( VAR ), 将图像和声音指令联系起来, 并进行最低限度的监督。使用这种代表, 我们产生一个内在的奖赏功能, 学习强化学习的机器人政策, 从而消除劳累的奖赏工程过程。我们展示了我们在各种机器人平台上的做法, 在那里, 机器人听到音频命令, 识别相关目标对象, 并精确控制完成声音指令。我们显示, 我们的方法超越了以前在各种声音类型和机器人任务上的工作, 即使标签数量更少。我们成功地将模拟器所学的政策运用到真正的Kinova Gen3 。我们还证明, 我们的VAR 和内在奖赏功能允许机器人只使用在现实世界中收集的少量标签数据来改进自己。