Recent works have applied the Proximal Policy Optimization (PPO) to the multi-agent cooperative tasks, such as Independent PPO (IPPO); and vanilla Multi-agent PPO (MAPPO) which has a centralized value function. However, previous literature shows that MAPPO may not perform as well as Independent PPO (IPPO) and the Fine-tuned QMIX on Starcraft Multi-Agent Challenge (SMAC). MAPPO-Feature-Pruned (MAPPO-FP) improves the performance of MAPPO by the carefully designed agent-specific features, which is is not friendly to algorithmic utility. By contrast, we find that MAPPO faces the problem of \textit{The Policies Overfitting in Multi-agent Cooperation(POMAC)}, as they learn policies by the sampled shared advantage values. Then POMAC may lead to updating the multi-agent policies in a suboptimal direction and prevent the agents from exploring better trajectories. In this paper, to solve the POMAC problem, we propose two novel policy perturbation methods, i.e, Noisy-Value MAPPO (NV-MAPPO) and Noisy-Advantage MAPPO (NA-MAPPO), which disturb the advantage values via random Gaussian noise. The experimental results show that our methods outperform the Fine-tuned QMIX, MAPPO-FP, and achieves SOTA on SMAC without agent-specific features. We open-source the code at \url{https://github.com/hijkzzz/noisy-mappo}.
翻译:最近的著作应用了Proximal政策优化(PPO)来完成多试剂合作任务,如独立PPO(IPPO)和具有集中价值功能的香草多试PPO(MAPPO),然而,以前的文献表明,MAPO可能无法同时执行独立PPO(IPPO)和关于星际手工业多点挑战的微调 QMIX(SMAAC) 。MAPO-Fat-Pruned(MAPPPO-PF)(MAP-FP)(MAP-PO-POP-PO-POPO-POPO-POPI) 改进了MAPPPO(IMA-MAPO-MAL) 的开放性能。在本文中,我们提出了两种新颖的政策,即不使用SOTIPO-MAPO-MA(NOPO-MAPA-MAPA-NOI-MAPA-MAPA-NOPA-MAPO-MAPO-MAPO-MALA-MAPO-MAPA-MAPO-MAPOA-S-NA-S-MAPO-NA-MAPOA-MAPO-MAPOA-MAPO-MAPO-S-S-S-MAPO-S-MAPO-MAPO-MAPO-MAPO-MAPO-R_MA-S-S-S-MAPO-MAPO_MAS-MAPO_MAPO_MAS-R_MAPO_MAPO-MAPO-MAS-MAS-MAPO-MAPO-R_MAS-MAS-MAS-MAPO-R_MAS-MAPO-MAPO-MAP-MAP-MAP-MAPO_MAPO-MAPO_MAPO-MAPO-MAPA-MAP-MAPA-MAP-MAP-MAP-MAP-R_R_MAPA_MAPA_MAS-MAPA_MAPA-MAPA_MAPA_MAPA_MAPA_MAP