行为表现预优化政策优化化 (Behavior Proximal Policy Optimization)

Offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly due to the overestimation of out-of-distribution state-action pairs. Thus, various additional augmentations are proposed to keep the learned policy close to the offline dataset (or the behavior policy). In this work, starting from the analysis of offline monotonic policy improvement, we get a surprising finding that some online on-policy algorithms are naturally able to solve offline RL. Specifically, the inherent conservatism of these on-policy algorithms is exactly what the offline RL method needs to overcome the overestimation. Based on this, we propose Behavior Proximal Policy Optimization (BPPO), which solves offline RL without any extra constraint or regularization introduced compared to PPO. Extensive experiments on the D4RL benchmark indicate this extremely succinct method outperforms state-of-the-art offline RL algorithms. Our implementation is available at https://github.com/Dragon-Zhuang/BPPO.

翻译：离线强化学习(RL)是一个具有挑战性的环境,在这种环境中,由于过度估计分配外的州-州-行动配对,现有的政策外的行为者-批评方法表现不佳。因此,建议采取各种额外增强措施,使学习的政策接近离线数据集(或行为政策 ) 。在这项工作中,从分析离线单调政策改进开始,我们得到一个令人惊讶的发现,即一些在线政策算法自然能够解决离线RL。具体地说,这些在政策上的内在保守算法正是离线的RL方法需要克服高估的方法。基于这一点,我们提议采取“BBPPO ” ( BPPO),它解决离线外的RL,而没有任何额外的限制或正规化。 D4RL 基准的广泛实验表明,这种极为简洁的方法超越了离线的状态- 艺术离线 RL 算法。我们在https://github.com/Dragon-Zhuang/BPPPPO) 。