Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. https://research.zenseact.com/publications/queryocc/
翻译:从图像中学习三维场景几何与语义是计算机视觉的核心挑战,也是自动驾驶的关键能力。由于大规模三维标注成本极高,近期研究探索无需人工标签、直接从传感器数据中实现自监督学习。现有方法要么依赖二维渲染一致性(三维结构仅隐式呈现),要么基于累积激光雷达点云生成的离散体素网格,限制了空间精度与可扩展性。本文提出QueryOcc,一种基于查询的自监督框架,通过跨相邻帧采样的独立四维时空查询直接学习连续的三维语义占据。该框架支持从视觉基础模型衍生的伪点云或原始激光雷达数据中进行监督。为实现恒定内存下的长程监督与推理,我们引入一种收缩式场景表示方法,在保持近场细节的同时平滑压缩远场区域。QueryOcc在自监督Occ3D-nuScenes基准测试中的语义射线交并比指标上超越先前基于相机的方法达26%,且以11.6 FPS的速度运行,证明直接的四维查询监督能够实现强大的自监督占据学习。https://research.zenseact.com/publications/queryocc/