混合政策梯级 (Mixed Policy Gradient)

Reinforcement learning (RL) has great potential in sequential decision-making. At present, the mainstream RL algorithms are data-driven, relying on millions of iterations and a large number of empirical data to learn a policy. Although data-driven RL may have excellent asymptotic performance, it usually yields slow convergence speed. As a comparison, model-driven RL employs a differentiable transition model to improve convergence speed, in which the policy gradient (PG) is calculated by using the backpropagation through time (BPTT) technique. However, such methods suffer from numerical instability, model error sensitivity and low computing efficiency, which may lead to poor policies. In this paper, a mixed policy gradient (MPG) method is proposed, which uses both empirical data and the transition model to construct the PG, so as to accelerate the convergence speed without losing the optimality guarantee. MPG contains two types of PG: 1) data-driven PG, which is obtained by directly calculating the derivative of the learned Q-value function with respect to actions, and 2) model-driven PG, which is calculated using BPTT based on the model-predictive return. We unify them by revealing the correlation between the upper bound of the unified PG error and the predictive horizon, where the data-driven PG is regraded as 0-step model-predictive return. Relying on that, MPG employs a rule-based method to adaptively adjust the weights of data-driven and model-driven PGs. In particular, to get a more accurate PG, the weight of the data-driven PG is designed to grow along the learning process while the other to decrease. Besides, an asynchronous learning framework is proposed to reduce the wall-clock time needed for each update iteration. Simulation results show that the MPG method achieves the best asymptotic performance and convergence speed compared with other baseline algorithms.

翻译：强化学习( RL) 在顺序决策中具有巨大的潜力。目前, 主流 RL 算法是数据驱动的, 依赖于数以百万计的迭代和大量经验数据来学习一项政策。虽然数据驱动的 RL 可能具有优异的无症状性性能, 但通常会产生缓慢的趋同速度。相比之下, 模型驱动的 RL 使用一种不同的过渡模式来提高趋同速度, 其中, 政策梯度( PG) 是通过使用时间反向回调( BBTTT) 技术来计算。然而, 主流 RL 算法存在数字不稳定、模型错误敏感性和低计算效率, 这可能导致政策不完善。在本文件中, 提出了一种混合的政策梯度( MPG) 方法, 既使用经验性数据和过渡模式来构建 PG, 从而加速趋同最佳的趋同速度。 MPG 提议 PG 数据驱动 PG 使用直接计算模型, 获取与行动相关的Q- 和模型驱动 PG 驱动 PG, 在模型上, 以方向方向校正向校正校正校校正进行校正校正校正校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校。

相关内容

关注 0

Pacific Graphics是亚洲图形协会的旗舰会议。作为一个非常成功的会议系列，太平洋图形公司为太平洋沿岸以及世界各地的研究人员，开发人员，从业人员提供了一个高级论坛，以介绍和讨论计算机图形学及相关领域的新问题，解决方案和技术。太平洋图形会议的目的是召集来自各个领域的研究人员，以展示他们的最新成果，开展合作并为研究领域的发展做出贡献。会议将包括定期的论文讨论会，进行中的讨论会，教程以及由与计算机图形学和交互系统相关的所有领域的国际知名演讲者的演讲。官网地址：http://dblp.uni-trier.de/db/conf/pg/index.html

【DeepMind】基于模型的强化学习，174页ppt，Model-Based Reinforcement Learning

专知会员服务

89+阅读 · 2021年1月12日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

53+阅读 · 2020年9月7日

商业数据分析，39页ppt

专知会员服务

165+阅读 · 2020年6月2日

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

专知会员服务

84+阅读 · 2020年2月18日