Reconstructing large-scale dynamic scenes from visual observations is a fundamental challenge in computer vision, with critical implications for robotics and autonomous systems. While recent differentiable rendering methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have achieved impressive photorealistic reconstruction, they suffer from scalability limitations and require annotations to decouple actor motion. Existing self-supervised methods attempt to eliminate explicit annotations by leveraging motion cues and geometric priors, yet they remain constrained by per-scene optimization and sensitivity to hyperparameter tuning. In this paper, we introduce Flux4D, a simple and scalable framework for 4D reconstruction of large-scale dynamic scenes. Flux4D directly predicts 3D Gaussians and their motion dynamics to reconstruct sensor observations in a fully unsupervised manner. By adopting only photometric losses and enforcing an "as static as possible" regularization, Flux4D learns to decompose dynamic elements directly from raw data without requiring pre-trained supervised models or foundational priors simply by training across many scenes. Our approach enables efficient reconstruction of dynamic scenes within seconds, scales effectively to large datasets, and generalizes well to unseen environments, including rare and unknown objects. Experiments on outdoor driving datasets show Flux4D significantly outperforms existing methods in scalability, generalization, and reconstruction quality.
翻译:从视觉观测中重建大规模动态场景是计算机视觉领域的一项基础性挑战,对机器人学和自主系统具有关键意义。尽管近期可微分渲染方法如神经辐射场(NeRF)和三维高斯泼溅(3DGS)已实现令人印象深刻的逼真重建,但它们存在可扩展性限制,且需要标注信息以解耦执行器运动。现有自监督方法尝试通过利用运动线索和几何先验来消除显式标注,但仍受限于逐场景优化过程及对超参数调优的敏感性。本文提出Flux4D——一个面向大规模动态场景四维重建的简洁可扩展框架。该框架通过直接预测三维高斯分布及其运动动力学,以完全无监督方式重建传感器观测数据。仅采用光度损失并实施“尽可能静态”的正则化策略,Flux4D通过跨场景训练即可直接从原始数据中解耦动态元素,无需依赖预训练的监督模型或基础先验知识。本方法能在数秒内高效重建动态场景,可有效扩展至大规模数据集,并对未见环境(包括罕见及未知物体)展现良好泛化能力。在户外驾驶数据集上的实验表明,Flux4D在可扩展性、泛化能力与重建质量方面显著优于现有方法。