合作式多智能体学习中多样性何时获得回报？ (When Is Diversity Rewarded in Cooperative Multi-Agent Learning?)

The success of teams in robotics, nature, and society often depends on the division of labor among diverse specialists; however, a principled explanation for when such diversity surpasses a homogeneous team is still missing. Focusing on multi-agent task allocation problems, we study this question from the perspective of reward design: what kinds of objectives are best suited for heterogeneous teams? We first consider an instantaneous, non-spatial setting where the global reward is built by two generalized aggregation operators: an inner operator that maps the $N$ agents' effort allocations on individual tasks to a task score, and an outer operator that merges the $M$ task scores into the global team reward. We prove that the curvature of these operators determines whether heterogeneity can increase reward, and that for broad reward families this collapses to a simple convexity test. Next, we ask what incentivizes heterogeneity to emerge when embodied, time-extended agents must learn an effort allocation policy. To study heterogeneity in such settings, we use multi-agent reinforcement learning (MARL) as our computational paradigm, and introduce Heterogeneity Gain Parameter Search (HetGPS), a gradient-based algorithm that optimizes the parameter space of underspecified MARL environments to find scenarios where heterogeneity is advantageous. Across different environments, we show that HetGPS rediscovers the reward regimes predicted by our theory to maximize the advantage of heterogeneity, both validating HetGPS and connecting our theoretical insights to reward design in MARL. Together, these results help us understand when behavioral diversity delivers a measurable benefit.

翻译：机器人学、自然界及社会中的团队成功往往依赖于多样化专家之间的分工协作；然而，对于多样性何时优于同质化团队，仍缺乏原理性解释。聚焦于多智能体任务分配问题，我们从奖励设计的角度研究该问题：何种目标函数最适合异构团队？首先考虑瞬时非空间场景，其中全局奖励由两个广义聚合算子构建：内层算子将N个智能体在各项任务上的努力分配映射为任务得分，外层算子将M个任务得分合并为团队全局奖励。我们证明这些算子的曲率决定了异构性是否能提升奖励，且对于广泛奖励函数族，该问题可简化为凸性检验。其次，我们探究当具身化、时间连续的智能体需学习努力分配策略时，何种机制会激励异构性产生。为此，我们采用多智能体强化学习（MARL）作为计算范式，并提出异构增益参数搜索算法（HetGPS）——一种基于梯度的算法，通过优化未完全指定的MARL环境参数空间，寻找异构性具有优势的场景。在不同环境中，我们证明HetGPS能重新发现理论预测的、可最大化异构性优势的奖励机制，既验证了HetGPS的有效性，也将理论洞见与MARL奖励设计相连接。这些结果共同帮助我们理解行为多样性何时能带来可量化的效益。