课程引导的大规模多智能体系统求解鲁棒长时程任务 (Curriculum Guided Massive Multi Agent System Solving For Robust Long Horizon Tasks)

Large Language Models and multi-agent systems have shown promise in decomposing complex tasks, yet they struggle with long-horizon reasoning tasks and escalating computation cost. This work introduces a hierarchical multi-agent architecture that distributes reasoning across a 64*64 grid of lightweight agents, supported by a selective oracle. A spatial curriculum progressively expands the operational region of the grid, ensuring that agents master easier central tasks before tackling harder peripheral ones. To improve reliability, the system integrates Negative Log-Likelihood as a measure of confidence, allowing the curriculum to prioritize regions where agents are both accurate and well calibrated. A Thompson Sampling curriculum manager adaptively chooses training zones based on competence and NLL-driven reward signals. We evaluate the approach on a spatially grounded Tower of Hanoi benchmark, which mirrors the long-horizon structure of many robotic manipulation and planning tasks. Results demonstrate improved stability, reduced oracle usage, and stronger long-range reasoning from distributed agent cooperation.

翻译：大型语言模型与多智能体系统在分解复杂任务方面展现出潜力，但在长时程推理任务中面临挑战，且计算成本随任务复杂度急剧上升。本研究提出一种分层多智能体架构，将推理任务分配至64*64网格的轻量级智能体群，并辅以选择性预言机支持。通过空间课程学习机制，系统逐步扩展网格操作区域，确保智能体先掌握中心区域的简单任务，再逐步攻克外围复杂任务。为提升系统可靠性，本架构引入负对数似然作为置信度度量，使课程学习能优先选择智能体预测准确且校准良好的区域。基于汤普森采样的课程管理器根据智能体能力与负对数似然驱动的奖励信号自适应选择训练区域。我们在空间具象化的汉诺塔基准任务上评估该方法，该任务模拟了机器人操作与规划中常见的长时程结构。实验结果表明，分布式智能体协作显著提升了系统稳定性、降低了预言机使用频率，并增强了长程推理能力。