UniUGP：面向端到端自动驾驶的统一理解、生成与规划框架 (UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving)

Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.

翻译：自动驾驶系统因世界知识有限和视觉动态建模能力不足，在长尾场景中面临挑战。现有的基于视觉-语言-动作的方法无法利用未标记视频进行视觉因果学习，而基于世界模型的方法则缺乏大语言模型的推理能力。本文构建了多个专用数据集，为复杂场景提供推理与规划标注。随后，提出了一种统一的理解-生成-规划框架（UniUGP），通过混合专家架构协同实现场景推理、未来视频生成与轨迹规划。通过集成预训练的视觉语言模型与视频生成模型，UniUGP利用视觉动态与语义推理提升规划性能。该框架以多帧观测和语言指令作为输入，生成可解释的思维链推理、物理一致的轨迹以及连贯的未来视频。我们提出了一种四阶段训练策略，在多个现有自动驾驶数据集及所构建的专用数据集上逐步构建这些能力。实验表明，该方法在感知、推理与决策方面均达到最先进性能，并在具有挑战性的长尾场景中展现出卓越的泛化能力。