Multi-goal reinforcement learning is widely applied in planning and robot manipulation. Two main challenges in multi-goal reinforcement learning are sparse rewards and sample inefficiency. Hindsight Experience Replay (HER) aims to tackle the two challenges via goal relabeling. However, HER-related works still need millions of samples and a huge computation. In this paper, we propose Multi-step Hindsight Experience Replay (MHER), incorporating multi-step relabeled returns based on $n$-step relabeling to improve sample efficiency. Despite the advantages of $n$-step relabeling, we theoretically and experimentally prove the off-policy $n$-step bias introduced by $n$-step relabeling may lead to poor performance in many environments. To address the above issue, two bias-reduced MHER algorithms, MHER($\lambda$) and Model-based MHER (MMHER) are presented. MHER($\lambda$) exploits the $\lambda$ return while MMHER benefits from model-based value expansions. Experimental results on numerous multi-goal robotic tasks show that our solutions can successfully alleviate off-policy $n$-step bias and achieve significantly higher sample efficiency than HER and Curriculum-guided HER with little additional computation beyond HER.
翻译:多目标强化学习在规划和机器人操作中广泛应用。多目标强化学习的两个主要挑战在于少有回报和抽样低效。 Hindsight Replay (HER) 的目的是通过目标重新标签来应对这两项挑战。然而,与她相关的工作仍然需要数百万个样本和巨大的计算。在本文件中,我们提议采用多步Hindsight Right (MHER) (MHER) 和基于模型的MHER(MHER) (MHER) 等多步重标签,包括基于美元分步重标签的多步重标签,以提高样本效率。尽管分步重标签的好处是零,但我们在理论上和实验上都证明了由零分标签(HER) 引入的非政策性零用美元分步重标签的偏差可能会导致许多环境的绩效差。为了解决上述问题,我们提出了两种有偏差的MHER(MHER) 算法,MHER($) 和基于模型的MHER(MHER) (MHER) (MH) 重标重标签(MHER) (MHER) (MER-B) (MHER-BAT) (MER-lish-lish-lish-lish-lish-lish-licle) roclex) 等解决方案可以成功的更多的升级制,成功的解决方案可以成功。