ZTRS：基于轨迹评分的零模仿端到端自动驾驶 (ZTRS: Zero-Imitation End-to-end Autonomous Driving with Trajectory Scoring)

End-to-end autonomous driving maps raw sensor inputs directly into ego-vehicle trajectories to avoid cascading errors from perception modules and to leverage rich semantic cues. Existing frameworks largely rely on Imitation Learning (IL), which can be limited by sub-optimal expert demonstrations and covariate shift during deployment. On the other hand, Reinforcement Learning (RL) has recently shown potential in scaling up with simulations, but is typically confined to low-dimensional symbolic inputs (e.g. 3D objects and maps), falling short of full end-to-end learning from raw sensor data. We introduce ZTRS (Zero-Imitation End-to-End Autonomous Driving with Trajectory Scoring), a framework that combines the strengths of both worlds: sensor inputs without losing information and RL training for robust planning. To the best of our knowledge, ZTRS is the first framework that eliminates IL entirely by only learning from rewards while operating directly on high-dimensional sensor data. ZTRS utilizes offline reinforcement learning with our proposed Exhaustive Policy Optimization (EPO), a variant of policy gradient tailored for enumerable actions and rewards. ZTRS demonstrates strong performance across three benchmarks: Navtest (generic real-world open-loop planning), Navhard (open-loop planning in challenging real-world and synthetic scenarios), and HUGSIM (simulated closed-loop driving). Specifically, ZTRS achieves the state-of-the-art result on Navhard and outperforms IL-based baselines on HUGSIM. Code will be available at https://github.com/woxihuanjiangguo/ZTRS.

翻译：端到端自动驾驶将原始传感器输入直接映射为自车轨迹，以避免感知模块的级联误差并利用丰富的语义线索。现有框架主要依赖于模仿学习（IL），但其可能受限于次优专家演示以及部署期间的协变量偏移。另一方面，强化学习（RL）近期在仿真扩展方面展现出潜力，但通常局限于低维符号输入（如3D物体与地图），未能实现从原始传感器数据的完整端到端学习。本文提出ZTRS（基于轨迹评分的零模仿端到端自动驾驶），该框架融合了两者的优势：在保留传感器输入信息完整性的同时，通过RL训练实现鲁棒的规划。据我们所知，ZTRS是首个完全摒弃模仿学习、仅通过奖励信号直接从高维传感器数据中学习的框架。ZTRS采用离线强化学习，并结合我们提出的穷举策略优化（EPO）——一种针对可枚举动作与奖励设计的策略梯度变体。ZTRS在三个基准测试中表现出色：Navtest（通用现实世界开环规划）、Navhard（复杂现实世界与合成场景的开环规划）以及HUGSIM（仿真闭环驾驶）。具体而言，ZTRS在Navhard上取得了最先进的结果，并在HUGSIM上超越了基于模仿学习的基线方法。代码将在 https://github.com/woxihuanjiangguo/ZTRS 公开。