Recent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding. We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR. ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations. While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes. Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames, handling dynamic content and propagating semantic information into a stable voxel representation. This data-centric shift in supervision for weakly/shelf-supervised occupancy estimation allows the use of essentially any SOTA occupancy model architecture without relying on LiDAR data. We argue that such high-quality supervision is essential for robust occupancy learning and constitutes an important complementary avenue to architectural innovation. On the Occ3D-nuScenes benchmark, ShelfOcc substantially outperforms all previous weakly/shelf-supervised methods (up to a 34% relative improvement), establishing a new data-driven direction for LiDAR-free 3D scene understanding.
翻译:近期自监督与弱监督占用估计的进展主要依赖于二维投影或基于渲染的监督方法,这些方法存在几何不一致性和严重的深度渗透问题。为此,我们提出了ShelfOcc,一种纯视觉方法,无需依赖激光雷达即可克服这些限制。ShelfOcc通过从视频中生成度量一致的语义体素标签,将监督引入原生三维空间,实现了无需额外传感器或人工三维标注的真实三维监督。尽管近期基于视觉的三维几何基础模型提供了有前景的先验知识来源,但由于其几何信息稀疏、噪声大且不一致,尤其在动态驾驶场景中,它们无法直接作为预测结果使用。我们的方法引入了一个专用框架,通过跨帧一致地过滤和累积静态几何、处理动态内容,并将语义信息传播到稳定的体素表示中,从而缓解了这些问题。这种以数据为中心的监督范式转变,使得弱监督/货架监督占用估计能够使用任何先进的占用模型架构,而无需依赖激光雷达数据。我们认为,这种高质量监督对于鲁棒的占用学习至关重要,并为架构创新提供了重要的补充路径。在Occ3D-nuScenes基准测试中,ShelfOcc显著优于所有先前的弱监督/货架监督方法(相对提升高达34%),为无激光雷达的三维场景理解开辟了新的数据驱动方向。