We introduce PPOPT - Proximal Policy Optimization using Pretraining, a novel, model-free deep-reinforcement-learning algorithm that leverages pretraining to achieve high training efficiency and stability on very small training samples in physics-based environments. Reinforcement learning agents typically rely on large samples of environment interactions to learn a policy. However, frequent interactions with a (computer-simulated) environment may incur high computational costs, especially when the environment is complex. Our main innovation is a new policy neural network architecture that consists of a pretrained neural network middle section sandwiched between two fully-connected networks. Pretraining part of the network on a different environment with similar physics will help the agent learn the target environment with high efficiency because it will leverage a general understanding of the transferrable physics characteristics from the pretraining environment. We demonstrate that PPOPT outperforms baseline classic PPO on small training samples both in terms of rewards gained and general training stability. While PPOPT underperforms against classic model-based methods such as DYNA DDPG, the model-free nature of PPOPT allows it to train in significantly less time than its model-based counterparts. Finally, we present our implementation of PPOPT as open-source software, available at github.com/Davidrxyang/PPOPT.


翻译:我们提出PPOPT——使用预训练的近端策略优化算法,这是一种新颖的无模型深度强化学习算法,通过利用预训练在基于物理的环境中实现高训练效率和稳定性,即使在极小的训练样本上也能表现优异。强化学习智能体通常依赖大量环境交互样本来学习策略。然而,频繁与(计算机模拟的)环境交互可能导致高昂的计算成本,尤其在环境复杂时。我们的主要创新是一种新的策略神经网络架构,该架构由一个预训练的神经网络中间层和两个全连接网络夹层组成。通过在具有相似物理特性的不同环境中对网络部分进行预训练,智能体能够利用从预训练环境中获取的可迁移物理特性的通用理解,从而高效学习目标环境。我们证明,在小训练样本上,PPOPT在获得的奖励和整体训练稳定性方面均优于基线经典PPO。虽然PPOPT在性能上不及经典基于模型的方法(如DYNA DDPG),但其无模型特性使其训练时间显著少于基于模型的对应方法。最后,我们将PPOPT的实现作为开源软件发布,可在github.com/Davidrxyang/PPOPT获取。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员