混合观察下等级强化学习 (Hierarchical Reinforcement Learning under Mixed Observability)

from arxiv, Accepted at the 15th International Workshop on the Algorithmic Foundations of Robotics (WAFR) 2022, University of Maryland, College Park. The first two authors contributed equally

The framework of mixed observable Markov decision processes (MOMDP) models many robotic domains in which some state variables are fully observable while others are not. In this work, we identify a significant subclass of MOMDPs defined by how actions influence the fully observable components of the state and how those, in turn, influence the partially observable components and the rewards. This unique property allows for a two-level hierarchical approach we call HIerarchical Reinforcement Learning under Mixed Observability (HILMO), which restricts partial observability to the top level while the bottom level remains fully observable, enabling higher learning efficiency. The top level produces desired goals to be reached by the bottom level until the task is solved. We further develop theoretical guarantees to show that our approach can achieve optimal and quasi-optimal behavior under mild assumptions. Empirical results on long-horizon continuous control tasks demonstrate the efficacy and efficiency of our approach in terms of improved success rate, sample efficiency, and wall-clock training time. We also deploy policies learned in simulation on a real robot.

翻译：混合可观测的Markov 决策程序(MOMDP) 模型框架包括许多机器人域,其中某些状态变量完全可见,而另一些则不完全可见。在这项工作中,我们确定了一个大型的MOMDP小类,其定义是:行动如何影响完全可观测的状态组成部分,以及这些行动又如何影响部分可观测的组成部分和奖赏。这一独特的属性可以采取两级分级办法。我们称之为:在混合可观测条件下的HILMO(HILMO),它限制部分可观测到顶层,而下层则保持完全可观测,从而能够提高学习效率。顶层产生了在任务解决之前在底层达到的预期目标。我们进一步制定了理论保证,以表明我们的方法能够在轻度假设下实现最佳和准最佳的行为。长视连续控制任务的实际结果显示我们方法在提高成功率、抽样效率和墙时训练时间方面的功效和效率。我们还运用了模拟实际机器人的政策。