Transfer Learning has shown great potential to enhance the single-agent Reinforcement Learning (RL) efficiency, by sharing learned policies of previous tasks. Similarly, in multiagent settings, the learning performance can also be promoted if agents can share knowledge between each other. However, it remains an open question of how an agent should learn from other agents' knowledge. In this paper, we propose a novel multiagent option-based policy transfer (MAOPT) framework to improve multiagent learning efficiency. Our framework learns what advice to give to each agent and when to terminate it by modeling multiagent policy transfer as the option learning problem. MAOPT provides different kinds of variants which can be classified into two types in terms of the experience used during training. One type is the MAOPT with the Global Option Advisor which has the access to the global information of the environment. However, in many realistic scenarios, we can only obtain each agent's local information due to the partial observation. The other type contains MAOPT with the Local Option Advisor and MAOPT with the Successor Representation Option (SRO) which are suitable for this setting and collect each agent's local experience for the update. In many cases, each agent's experience is inconsistent with each other which causes the option-value estimation to oscillate and to become inaccurate. SRO is used to handle the experience inconsistency by decoupling the dynamics of the environment from the rewards to learn the option-value function under each agent's preference. MAOPT can be easily combined with existing deep RL approaches. Experimental results show it significantly boosts the performance of existing deep RL methods in both discrete and continuous state spaces.
翻译:同样,在多个试剂环境下,如果代理商能够相互分享知识,学习成绩也可以被提升。然而,对于代理商如何从其他代理商的知识中学习,这仍然是一个尚未解决的问题。在本文中,我们提议建立一个新的多试剂基于选择的政策转让框架(MAOPT),以提高多试剂学习效率。我们的框架学习如何向每个代理商提供建议,以及何时通过将多试剂政策转移作为选项学习问题来终止它。MAOPT提供了不同种类的变体,这些变体可以按照培训期间使用的经验分为两类。一种是MAOPT与全球备选顾问一起,该代理商应当如何从其他代理商的知识中学习。然而,在许多现实情况下,我们只能获得每个代理商的多试剂基于选择的政策转让框架(MAOPT)来提高多试剂学习效率。另一种类型的框架包含MAOPT与当地选择顾问和MAOPT的快速递增政策转移方法,这些方法适合这一设置,并收集每个代理商在培训过程中的当地经验,从而不断更新SOPL的变现的变数。