Continuous control of non-stationary environments is a major challenge for deep reinforcement learning algorithms. The time-dependency of the state transition dynamics aggravates the notorious stability problems of model-free deep actor-critic architectures. We posit that two properties will play a key role in overcoming non-stationarity in transition dynamics: (i)~preserving the plasticity of the critic network and (ii) directed exploration for rapid adaptation to changing dynamics. We show that performing on-policy reinforcement learning with an evidential critic provides both. The evidential design ensures a fast and accurate approximation of the uncertainty around the state value, which maintains the plasticity of the critic network by detecting the distributional shifts caused by changes in dynamics. The probabilistic critic also makes the actor training objective a random variable, enabling the use of directed exploration approaches as a by-product. We name the resulting algorithm \emph{Evidential Proximal Policy Optimization (EPPO)} due to the integral role of evidential uncertainty quantification in both policy evaluation and policy improvement stages. Through experiments on non-stationary continuous control tasks, where the environment dynamics change at regular intervals, we demonstrate that our algorithm outperforms state-of-the-art on-policy reinforcement learning variants in both task-specific and overall return.
翻译:深度强化学习算法面临的主要挑战之一是非平稳环境的连续控制。状态转移动态的时间依赖性加剧了无模型深度演员-评论家架构中众所周知的稳定性问题。我们认为,克服转移动态非平稳性的关键在于两个特性:(i)保持评论家网络的可塑性;(ii)通过定向探索快速适应动态变化。研究表明,采用证据评论家进行同策略强化学习可同时满足这两点。证据设计确保了对状态价值不确定性的快速准确近似,通过检测动态变化引起的分布偏移来维持评论家网络的可塑性。概率化评论家还使演员训练目标成为随机变量,从而能够附带实现定向探索方法。由于证据不确定性量化在策略评估和策略改进阶段的核心作用,我们将所得算法命名为\\emph{证据近端策略优化(EPPO)}。通过在环境动态定期变化的非平稳连续控制任务上进行实验,我们证明该算法在任务特定回报和总体回报方面均优于最先进的同策略强化学习变体。