Autonomous driving holds transformative potential but remains fundamentally constrained by the limited perception and isolated decision-making with standalone intelligence. While recent multi-agent approaches introduce cooperation, they often focus merely on perception-level tasks, overlooking the alignment with downstream planning and control, or fall short in leveraging the full capacity of the recent emerging end-to-end autonomous driving. In this paper, we present UniMM-V2X, a novel end-to-end multi-agent framework that enables hierarchical cooperation across perception, prediction, and planning. At the core of our framework is a multi-level fusion strategy that unifies perception and prediction cooperation, allowing agents to share queries and reason cooperatively for consistent and safe decision-making. To adapt to diverse downstream tasks and further enhance the quality of multi-level fusion, we incorporate a Mixture-of-Experts (MoE) architecture to dynamically enhance the BEV representations. We further extend MoE into the decoder to better capture diverse motion patterns. Extensive experiments on the DAIR-V2X dataset demonstrate our approach achieves state-of-the-art (SOTA) performance with a 39.7% improvement in perception accuracy, a 7.2% reduction in prediction error, and a 33.2% improvement in planning performance compared with UniV2X, showcasing the strength of our MoE-enhanced multi-level cooperative paradigm.
翻译:自动驾驶具有变革性潜力,但其根本上仍受限于独立智能体有限的感知能力和孤立的决策机制。尽管近期的多智能体方法引入了协同机制,但这些方法往往仅聚焦于感知层面的任务,忽视了下游规划与控制的对齐,或未能充分利用新兴端到端自动驾驶技术的全部潜力。本文提出UniMM-V2X,一种新颖的端到端多智能体框架,实现了感知、预测与规划的分层协同。该框架的核心是多层次融合策略,统一了感知与预测的协同过程,使智能体能够共享查询并协同推理,从而实现一致且安全的决策。为适应多样化的下游任务并进一步提升多层次融合的质量,我们引入了混合专家(Mixture-of-Experts,MoE)架构,以动态增强鸟瞰图(BEV)表征。我们进一步将MoE扩展至解码器,以更好地捕捉多样化的运动模式。在DAIR-V2X数据集上的大量实验表明,与UniV2X相比,我们的方法在感知准确率上提升了39.7%,预测误差降低了7.2%,规划性能提高了33.2%,实现了最先进的性能,充分展示了MoE增强的多层次协同范式的优势。