如何花费机器人时间:为基于愿景的机器人操纵搭桥启动和离线强化学习 (How to Spend Your Robot Time: Bridging Kickstarting and Offline Reinforcement Learning for Vision-based Robotic Manipulation)

Reinforcement learning (RL) has been shown to be effective at learning control from experience. However, RL typically requires a large amount of online interaction with the environment. This limits its applicability to real-world settings, such as in robotics, where such interaction is expensive. In this work we investigate ways to minimize online interactions in a target task, by reusing a suboptimal policy we might have access to, for example from training on related prior tasks, or in simulation. To this end, we develop two RL algorithms that can speed up training by using not only the action distributions of teacher policies, but also data collected by such policies on the task at hand. We conduct a thorough experimental study of how to use suboptimal teachers on a challenging robotic manipulation benchmark on vision-based stacking with diverse objects. We compare our methods to offline, online, offline-to-online, and kickstarting RL algorithms. By doing so, we find that training on data from both the teacher and student, enables the best performance for limited data budgets. We examine how to best allocate a limited data budget -- on the target task -- between the teacher and the student policy, and report experiments using varying budgets, two teachers with different degrees of suboptimality, and five stacking tasks that require a diverse set of behaviors. Our analysis, both in simulation and in the real world, shows that our approach is the best across data budgets, while standard offline RL from teacher rollouts is surprisingly effective when enough data is given.

翻译：强化学习( RL) 已被证明在从经验中学习控制方面是有效的。然而, RL 通常需要大量与环境进行在线互动。这限制了它适用于真实世界环境, 例如机器人, 机器人的这种互动非常昂贵。在这项工作中, 我们调查如何在目标任务中最大限度地减少在线互动, 重新使用亚最佳政策, 例如从相关先前任务的培训或模拟中进行。为此, 我们开发了两种 RL 算法, 不仅使用教师政策的行动分布, 也使用这类政策所收集的手头任务的数据来加快培训速度。我们进行一项彻底的实验研究, 研究如何使用亚优的机器人教师, 在一个具有挑战性的机器人操作基准上, 将不同对象堆叠在一起。我们比较了我们的方法, 例如从在线、离线到在线, 启动RL 算法。我们发现, 师生关于数据的培训, 使得有限的数据预算能够实现最佳绩效。我们研究如何最佳地在目标任务中分配两个有限的滚动数据预算 -- 由不同层次的教师和学生分析。