Most of reinforcement learning algorithms optimize the discounted criterion which is beneficial to accelerate the convergence and reduce the variance of estimates. Although the discounted criterion is appropriate for certain tasks such as financial related problems, many engineering problems treat future rewards equally and prefer a long-run average criterion. In this paper, we study the reinforcement learning problem with the long-run average criterion. Firstly, we develop a unified trust region theory with discounted and average criteria and derive a novel performance bound within the trust region with the Perturbation Analysis (PA) theory. Secondly, we propose a practical algorithm named Average Policy Optimization (APO), which improves the value estimation with a novel technique named Average Value Constraint. Finally, experiments are conducted in the continuous control environment MuJoCo. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiveness of our approach. Our work provides a unified framework of the trust region approach including both the discounted and average criteria, which may complement the framework of reinforcement learning beyond the discounted objectives.
翻译:强化学习算法大多优化了有利于加速趋同和减少估计数差异的折扣标准。尽管折扣标准适用于金融相关问题等某些任务,但许多工程问题对未来奖励一视同仁,倾向于长期平均标准。在本文件中,我们用长期平均标准研究强化学习问题。首先,我们开发了具有折扣和平均标准的统一信任区域理论,并在信任区域内根据周期性分析(PA)理论产生新的性能。第二,我们提出了名为“平均政策优化(APO)”的实用算法,用名为“平均价值约束”的新技术改进价值估计。最后,实验是在持续控制环境中进行的。在大多数情况下,APO的表现优于折扣PPO,这显示了我们的方法的有效性。我们的工作为信任区域方法提供了一个统一的框架,包括折扣和平均标准,这可以补充超越折扣目标的强化学习框架。