Given just a few glimpses of a scene, can you imagine the movie playing out as the camera glides through it? That's the lens we take on \emph{sparse-input novel view synthesis}, not only as filling spatial gaps between widely spaced views, but also as \emph{completing a natural video} unfolding through space. We recast the task as \emph{test-time natural video completion}, using powerful priors from \emph{pretrained video diffusion models} to hallucinate plausible in-between views. Our \emph{zero-shot, generation-guided} framework produces pseudo views at novel camera poses, modulated by an \emph{uncertainty-aware mechanism} for spatial coherence. These synthesized frames densify supervision for \emph{3D Gaussian Splatting} (3D-GS) for scene reconstruction, especially in under-observed regions. An iterative feedback loop lets 3D geometry and 2D view synthesis inform each other, improving both the scene reconstruction and the generated views. The result is coherent, high-fidelity renderings from sparse inputs \emph{without any scene-specific training or fine-tuning}. On LLFF, DTU, DL3DV, and MipNeRF-360, our method significantly outperforms strong 3D-GS baselines under extreme sparsity.
翻译:仅给定场景的少量一瞥,你能否想象摄像机滑过时播放的电影?这正是我们对稀疏输入新视图合成的视角——不仅填补广泛间隔视图之间的空间间隙,而且将其视为在空间中展开的自然视频的补全。我们将该任务重新定义为测试时自然视频补全,利用预训练视频扩散模型中的强大先验来幻觉出合理的中间视图。我们的零样本、生成引导框架在新摄像机姿态下生成伪视图,并通过不确定性感知机制进行空间一致性调制。这些合成帧为3D高斯溅射(3D-GS)的场景重建提供了密集监督,尤其在观测不足区域。迭代反馈循环使3D几何与2D视图合成相互促进,从而同时改善场景重建和生成视图的质量。该方法无需任何场景特定训练或微调,即可从稀疏输入中生成连贯、高保真的渲染结果。在LLFF、DTU、DL3DV和MipNeRF-360数据集上,我们的方法在极端稀疏条件下显著优于强3D-GS基线。