Recent progress in large language models (LLMs) has been propelled by reinforcement learning with verifiable rewards (RLVR) and test-time scaling. However, the limited output length of LLMs constrains the depth of reasoning attainable in a single inference process. Multi-agent reasoning systems offer a promising alternative by employing multiple agents including Solver, Verifier, and Corrector, to iteratively refine solutions. While effective in closed-source models like Gemini 2.5 Pro, they struggle to generalize to open-source models due to insufficient critic and correction capabilities. To address this, we propose MarsRL, a novel reinforcement learning framework with agentic pipeline parallelism, designed to jointly optimize all agents in the system. MarsRL introduces agent-specific reward mechanisms to mitigate reward noise and employs pipeline-inspired training to enhance efficiency in handling long trajectories. Applied to Qwen3-30B-A3B-Thinking-2507, MarsRL improves AIME2025 accuracy from 86.5% to 93.3% and BeyondAIME from 64.9% to 73.8%, even surpassing Qwen3-235B-A22B-Thinking-2507. These findings highlight the potential of MarsRL to advance multi-agent reasoning systems and broaden their applicability across diverse reasoning tasks.
翻译:近期大语言模型(LLMs)的进展得益于带可验证奖励的强化学习(RLVR)和测试时扩展的推动。然而,LLMs有限的输出长度制约了单次推理过程中可达到的推理深度。多智能体推理系统提供了一种有前景的替代方案,它通过部署包括求解器、验证器和校正器在内的多个智能体来迭代优化解决方案。尽管在如Gemini 2.5 Pro这样的闭源模型中有效,但由于批评和校正能力不足,此类系统难以推广到开源模型。为解决此问题,我们提出了MarsRL,一种具有智能体流水线并行的新型强化学习框架,旨在联合优化系统中的所有智能体。MarsRL引入了针对特定智能体的奖励机制以减轻奖励噪声,并采用受流水线启发的训练方法来提升处理长轨迹的效率。应用于Qwen3-30B-A3B-Thinking-2507模型时,MarsRL将AIME2025准确率从86.5%提升至93.3%,将BeyondAIME准确率从64.9%提升至73.8%,甚至超越了Qwen3-235B-A22B-Thinking-2507模型。这些发现凸显了MarsRL在推进多智能体推理系统并拓宽其在多样化推理任务中适用性方面的潜力。