Autonomous driving requires a persistent understanding of 3D scenes that is robust to temporal disturbances and accounts for potential future actions. We introduce a new concept of 4D Occupancy Spatio-Temporal Persistence (OccSTeP), which aims to address two tasks: (1) reactive forecasting: ''what will happen next'' and (2) proactive forecasting: "what would happen given a specific future action". For the first time, we create a new OccSTeP benchmark with challenging scenarios (e.g., erroneous semantic labels and dropped frames). To address this task, we propose OccSTeP-WM, a tokenizer-free world model that maintains a dense voxel-based scene state and incrementally fuses spatio-temporal context over time. OccSTeP-WM leverages a linear-complexity attention backbone and a recurrent state-space module to capture long-range spatial dependencies while continually updating the scene memory with ego-motion compensation. This design enables online inference and robust performance even when historical sensor input is missing or noisy. Extensive experiments prove the effectiveness of the OccSTeP concept and our OccSTeP-WM, yielding an average semantic mIoU of 23.70% (+6.56% gain) and occupancy IoU of 35.89% (+9.26% gain). The data and code will be open source at https://github.com/FaterYU/OccSTeP.
翻译:自动驾驶需要一种对三维场景的持续性理解,这种理解需对时间干扰具有鲁棒性,并能考虑潜在的未来行为。我们引入了一个新的概念——4D占据时空持续性(OccSTeP),旨在解决两个任务:(1)反应式预测:'接下来会发生什么';(2)主动式预测:'给定特定未来行为会发生什么'。我们首次创建了一个包含挑战性场景(例如,错误的语义标签和丢帧)的OccSTeP基准。为解决此任务,我们提出了OccSTeP-WM,这是一种无分词器的世界模型,它维护基于密集体素的场景状态,并随时间逐步融合时空上下文。OccSTeP-WM利用线性复杂度注意力骨干和循环状态空间模块,以捕获长程空间依赖性,同时通过自运动补偿持续更新场景记忆。这种设计使得在线推理成为可能,即使在历史传感器输入缺失或存在噪声时也能保持鲁棒性能。大量实验证明了OccSTeP概念及我们的OccSTeP-WM的有效性,实现了平均语义mIoU为23.70%(提升6.56%)和占据IoU为35.89%(提升9.26%)。数据和代码将在https://github.com/FaterYU/OccSTeP开源。