在双层阿塔里运动会中找到不爆炸战略的深入强化学习方法 (A Deep Reinforcement Learning Approach for Finding Non-Exploitable Strategies in Two-Player Atari Games)

This paper proposes novel, end-to-end deep reinforcement learning algorithms for learning two-player zero-sum Markov games. Our objective is to find the Nash Equilibrium policies, which are free from exploitation by adversarial opponents. Distinct from prior efforts on finding Nash equilibria in extensive-form games such as Poker, which feature tree-structured transition dynamics and discrete state space, this paper focuses on Markov games with general transition dynamics and continuous state space. We propose (1) Nash DQN algorithm, which integrates DQN with a Nash finding subroutine for the joint value functions; and (2) Nash DQN Exploiter algorithm, which additionally adopts an exploiter for guiding agent's exploration. Our algorithms are the practical variants of theoretical algorithms which are guaranteed to converge to Nash equilibria in the basic tabular setting. Experimental evaluation on both tabular examples and two-player Atari games demonstrates the robustness of the proposed algorithms against adversarial opponents, as well as their advantageous performance over existing methods.

翻译：本文提出了用于学习双玩者零和马尔科夫游戏的新型、端到端深强化学习算法。我们的目标是找到纳什平衡政策,这种政策不受敌对对手的利用。不同于先前在波克等广泛形式的游戏中寻找纳什平衡的努力,如Poker,它具有树结构过渡动态和离散状态空间,本文侧重于Markov游戏,具有一般过渡动态和连续状态空间。我们提议:(1) Nash DQN算法,它将DQN与纳什寻找联合价值函数的子例程相结合;和(2) Nash DQN Exploiter算法,它进一步采用一个开发者来指导代理人的探索。我们的算法是理论算法的实用变量,保证在基本表格设置中与Nash equilibria 趋同。对表格示例和两个玩家Atari游戏的实验性评价显示了拟议对抗对手的算法的稳健性,以及它们对现有方法的有利性表现。