数据效率强化学习的同时信用分配 (Concurrent Credit Assignment for Data-efficient Reinforcement Learning)

The capability to widely sample the state and action spaces is a key ingredient toward building effective reinforcement learning algorithms. The variational optimization principles exposed in this paper emphasize the importance of an occupancy model to synthesizes the general distribution of the agent's environmental states over which it can act (defining a virtual ``territory''). The occupancy model is the subject of frequent updates as the exploration progresses and that new states are undisclosed during the course of the training. By making a uniform prior assumption, the resulting objective expresses a balance between two concurrent tendencies, namely the widening of the occupancy space and the maximization of the rewards, reminding of the classical exploration/exploitation trade-off. Implemented on an actor-critic off-policy on classic continuous action benchmarks, it is shown to provide significant increase in the sampling efficacy, that is reflected in a reduced training time and higher returns, in both the dense and the sparse rewards cases.

翻译：对州和行动空间进行广泛抽样的能力是建立有效强化学习算法的一个关键要素。本文件所揭示的变通优化原则强调使用模式的重要性,以综合该代理人可以采取行动的环境状态的总体分布(确定虚拟的“领土 ” ) 。随着勘探的进展,占用模式经常更新,而且在培训过程中没有披露新的状态。通过作出统一的事先假设,由此产生的目标体现了两种同时趋势之间的平衡,即占用空间的扩大和奖励的最大化,提醒人们注意典型的勘探/开采交易。在传统的连续行动基准上,根据行为者-批评的离岸政策实施,它显示抽样效果显著提高,这体现在培训时间减少,在密集和稀少的奖励案例中,培训回报率提高。