Reinforcement learning (RL) has great potential in sequential decision-making. At present, the mainstream RL algorithms are data-driven, relying on millions of iterations and a large number of empirical data to learn a policy. Although data-driven RL may have excellent asymptotic performance, it usually yields slow convergence speed. As a comparison, model-driven RL employs a differentiable transition model to improve convergence speed, in which the policy gradient (PG) is calculated by using the backpropagation through time (BPTT) technique. However, such methods suffer from numerical instability, model error sensitivity and low computing efficiency, which may lead to poor policies. In this paper, a mixed policy gradient (MPG) method is proposed, which uses both empirical data and the transition model to construct the PG, so as to accelerate the convergence speed without losing the optimality guarantee. MPG contains two types of PG: 1) data-driven PG, which is obtained by directly calculating the derivative of the learned Q-value function with respect to actions, and 2) model-driven PG, which is calculated using BPTT based on the model-predictive return. We unify them by revealing the correlation between the upper bound of the unified PG error and the predictive horizon, where the data-driven PG is regraded as 0-step model-predictive return. Relying on that, MPG employs a rule-based method to adaptively adjust the weights of data-driven and model-driven PGs. In particular, to get a more accurate PG, the weight of the data-driven PG is designed to grow along the learning process while the other to decrease. Besides, an asynchronous learning framework is proposed to reduce the wall-clock time needed for each update iteration. Simulation results show that the MPG method achieves the best asymptotic performance and convergence speed compared with other baseline algorithms.
翻译:强化学习( RL) 在顺序决策中具有巨大的潜力 。 目前, 主流 RL 算法是数据驱动的, 依赖于数以百万计的迭代和大量经验数据来学习一项政策。 虽然数据驱动的 RL 可能具有优异的无症状性性能, 但通常会产生缓慢的趋同速度。 相比之下, 模型驱动的 RL 使用一种不同的过渡模式来提高趋同速度, 其中, 政策梯度( PG) 是通过使用时间反向回调( BBTTT) 技术来计算。 然而, 主流 RL 算法存在数字不稳定、 模型错误敏感性和低计算效率, 这可能导致政策不完善。 在本文件中, 提出了一种混合的政策梯度( MPG) 方法, 既使用经验性数据和过渡模式来构建 PG, 从而加速趋同最佳的趋同速度。 MPG 提议 PG 数据驱动 PG 使用直接计算模型, 获取与行动相关的Q- 和 模型驱动 PG 驱动 PG, 在模型上, 以 方向 方向 校正向 校正 校正 校 校正 进行 校正 校正 校正 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 。