深多机构加强机构学习中的建构建议 (Shaping Advice in Deep Multi-Agent Reinforcement Learning)

Multi-agent reinforcement learning involves multiple agents interacting with each other and a shared environment to complete tasks. When rewards provided by the environment are sparse, agents may not receive immediate feedback on the quality of actions that they take, thereby affecting learning of policies. In this paper, we propose a method called Shaping Advice in deep Multi-agent reinforcement learning (SAM) to augment the reward signal from the environment with an additional reward termed shaping advice. The shaping advice is given by a difference of potential functions at consecutive time-steps. Each potential function is a function of observations and actions of the agents. The shaping advice needs to be specified only once at the start of training, and can be easily provided by non-experts. We show through theoretical analyses and experimental validation that shaping advice provided by SAM does not distract agents from completing tasks specified by the environment reward. Theoretically, we prove that convergence of policy gradients and value functions when using SAM implies convergence of these quantities in the absence of SAM. Experimentally, we evaluate SAM on three tasks in the multi-agent Particle World environment that have sparse rewards. We observe that using SAM results in agents learning policies to complete tasks faster, and obtain higher rewards than: i) using sparse rewards alone; ii) a state-of-the-art reward redistribution method.

翻译：多代理人强化学习涉及多个代理人相互互动,并有一个共同完成任务的环境。当环境提供的奖励很少时,代理人可能不会立即得到关于所采取行动质量的反馈,从而影响对政策的学习。在本文件中,我们提出一种方法,即深多代理人强化学习(SAM)中的“塑造建议”来增加来自环境的奖赏信号,增加一个称为塑造建议的额外奖赏。塑造建议是通过连续时间步骤中潜在功能的差异来提供的。每个潜在功能是代理人观察和行动的一个函数。塑造建议需要只在培训开始时说明一次,并且很容易由非专家提供。我们通过理论分析和实验验证表明,通过理论分析和实验,我们显示,塑造SAM提供的建议不会分散代理人完成环境奖赏规定的任务。理论上,我们证明,在使用SAM时,政策梯度和价值功能的趋同意味着这些数量在没有SAM的情况下会趋于一致。实验性地说,我们评估多代理人Part World环境的三项任务,其奖赏很少。我们观察到,只有SAM在代理人学习政策时,才能使用累近、高的奖赏方法。