We present Animus3D, a text-driven 3D animation framework that generates motion field given a static 3D asset and text prompt. Previous methods mostly leverage the vanilla Score Distillation Sampling (SDS) objective to distill motion from pretrained text-to-video diffusion, leading to animations with minimal movement or noticeable jitter. To address this, our approach introduces a novel SDS alternative, Motion Score Distillation (MSD). Specifically, we introduce a LoRA-enhanced video diffusion model that defines a static source distribution rather than pure noise as in SDS, while another inversion-based noise estimation technique ensures appearance preservation when guiding motion. To further improve motion fidelity, we incorporate explicit temporal and spatial regularization terms that mitigate geometric distortions across time and space. Additionally, we propose a motion refinement module to upscale the temporal resolution and enhance fine-grained details, overcoming the fixed-resolution constraints of the underlying video model. Extensive experiments demonstrate that Animus3D successfully animates static 3D assets from diverse text prompts, generating significantly more substantial and detailed motion than state-of-the-art baselines while maintaining high visual integrity. Code will be released at https://qiisun.github.io/animus3d_page.
翻译:本文提出Animus3D,一种文本驱动的三维动画框架,能够根据静态三维资产和文本提示生成运动场。现有方法主要利用原始分数蒸馏采样(SDS)目标从预训练的文本到视频扩散模型中蒸馏运动,导致生成的动画运动幅度极小或存在明显抖动。为解决此问题,我们的方法引入了一种新颖的SDS替代方案——运动分数蒸馏(MSD)。具体而言,我们引入了一个LoRA增强的视频扩散模型,它定义了静态源分布而非SDS中的纯噪声,同时另一种基于反转的噪声估计技术确保了在引导运动时保持外观一致性。为进一步提升运动保真度,我们引入了显式的时间和空间正则化项,以减轻跨时空的几何失真。此外,我们提出了一个运动细化模块,用于提升时间分辨率并增强细粒度细节,从而克服底层视频模型固定分辨率的限制。大量实验表明,Animus3D能够成功根据多样化的文本提示对静态三维资产进行动画化,相比现有最先进的基线方法,能生成运动幅度更显著、细节更丰富的动画,同时保持高度的视觉完整性。代码将在https://qiisun.github.io/animus3d_page发布。