The performance of multi-agent reinforcement learning (MARL) in partially observable environments depends on effectively aggregating information from observations, communications, and reward signals. While most existing multi-agent systems primarily rely on rewards as the only feedback for policy training, our research shows that introducing auxiliary predictive tasks can significantly enhance learning efficiency and stability. We propose Belief-based Predictive Auxiliary Learning (BEPAL), a framework that incorporates auxiliary training objectives to support policy optimization. BEPAL follows the centralized training with decentralized execution paradigm. Each agent learns a belief model that predicts unobservable state information, such as other agents' rewards or motion directions, alongside its policy model. By enriching hidden state representations with information that does not directly contribute to immediate reward maximization, this auxiliary learning process stabilizes MARL training and improves overall performance. We evaluate BEPAL in the predator-prey environment and Google Research Football, where it achieves an average improvement of about 16 percent in performance metrics and demonstrates more stable convergence compared to baseline methods.
翻译:部分可观测环境下多智能体强化学习(MARL)的性能取决于能否有效聚合来自观测、通信和奖励信号的信息。虽然现有大多数多智能体系统主要依赖奖励作为策略训练的唯一反馈,但我们的研究表明,引入辅助预测任务能显著提升学习效率与稳定性。我们提出基于信念的预测辅助学习(BEPAL)框架,该框架通过整合辅助训练目标来支持策略优化。BEPAL遵循集中训练与分散执行的范式:每个智能体在学习策略模型的同时,还学习一个信念模型,用于预测不可观测的状态信息(如其他智能体的奖励或运动方向)。通过利用不直接贡献于即时奖励最大化的信息来丰富隐藏状态表示,这一辅助学习过程稳定了MARL训练并提升了整体性能。我们在捕食者-猎物环境和Google Research Football中对BEPAL进行评估,结果显示其性能指标平均提升约16%,且与基线方法相比展现出更稳定的收敛性。