基于单张图像的联合三维几何重建与运动生成实现四维合成 (Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image)

Generating interactive and dynamic 4D scenes from a single static image remains a core challenge. Most existing generate-then-reconstruct and reconstruct-then-generate methods decouple geometry from motion, causing spatiotemporal inconsistencies and poor generalization. To address these, we extend the reconstruct-then-generate framework to jointly perform Motion generation and geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image. Code: https://github.com/Zhangyr2022/MoRe4D.

翻译：从单张静态图像生成具有交互性和动态性的四维场景仍是一个核心挑战。现有的大多数“先生成后重建”和“先重建后生成”方法将几何与运动解耦，导致时空不一致性和泛化能力差。为解决这些问题，我们将“先重建后生成”框架扩展为联合执行运动生成与几何重建的四维合成方法（MoRe4D）。我们首先引入了TrajScene-60K，这是一个包含60,000个带有密集点轨迹的视频样本的大规模数据集，以解决高质量四维场景数据稀缺的问题。基于此，我们提出了一种基于扩散的四维场景轨迹生成器（4D-STraG），用于联合生成几何一致且运动合理的四维点轨迹。为利用单视角先验，我们设计了一种深度引导的运动归一化策略和一个运动感知模块，以实现有效的几何与动力学集成。随后，我们提出了一个四维视图合成模块（4D-ViSM），用于从四维点轨迹表示中渲染任意相机轨迹的视频。实验表明，MoRe4D能够从单张图像生成具有多视角一致性和丰富动态细节的高质量四维场景。代码：https://github.com/Zhangyr2022/MoRe4D。