We introduce Dream, Lift, Animate (DLA), a novel framework that reconstructs animatable 3D human avatars from a single image. This is achieved by leveraging multi-view generation, 3D Gaussian lifting, and pose-aware UV-space mapping of 3D Gaussians. Given an image, we first dream plausible multi-views using a video diffusion model, capturing rich geometric and appearance details. These views are then lifted into unstructured 3D Gaussians. To enable animation, we propose a transformer-based encoder that models global spatial relationships and projects these Gaussians into a structured latent representation aligned with the UV space of a parametric body model. This latent code is decoded into UV-space Gaussians that can be animated via body-driven deformation and rendered conditioned on pose and viewpoint. By anchoring Gaussians to the UV manifold, our method ensures consistency during animation while preserving fine visual details. DLA enables real-time rendering and intuitive editing without requiring post-processing. Our method outperforms state-of-the-art approaches on the ActorsHQ and 4D-Dress datasets in both perceptual quality and photometric accuracy. By combining the generative strengths of video diffusion models with a pose-aware UV-space Gaussian mapping, DLA bridges the gap between unstructured 3D representations and high-fidelity, animation-ready avatars.
翻译:我们提出了梦想、提升、动画化(Dream, Lift, Animate,DLA)这一新颖框架,它能够从单张图像重建可动画化的三维人体化身。该框架通过利用多视角生成、三维高斯提升以及三维高斯的姿态感知UV空间映射来实现这一目标。给定一张图像,我们首先使用视频扩散模型生成合理的多视角图像,以捕捉丰富的几何和外观细节。这些视角随后被提升为无结构的三维高斯分布。为了实现动画化,我们提出了一种基于Transformer的编码器,该编码器建模全局空间关系,并将这些高斯分布投影到与参数化人体模型的UV空间对齐的结构化潜在表示中。这个潜在编码被解码为UV空间中的高斯分布,这些分布可以通过身体驱动的形变进行动画化,并根据姿态和视点进行渲染。通过将高斯分布锚定在UV流形上,我们的方法确保了动画过程中的一致性,同时保留了精细的视觉细节。DLA支持实时渲染和直观编辑,无需后处理。在ActorsHQ和4D-Dress数据集上,我们的方法在感知质量和光度精度方面均优于现有最先进方法。通过结合视频扩散模型的生成能力与姿态感知的UV空间高斯映射,DLA弥合了无结构三维表示与高保真、可动画化化身之间的差距。