We present a framework for generating music-synchronized, choreography aware animal dance videos. Our framework introduces choreography patterns -- structured sequences of motion beats that define the long-range structure of a dance -- as a novel high-level control signal for dance video generation. These patterns can be automatically estimated from human dance videos. Starting from a few keyframes representing distinct animal poses, generated via text-to-image prompting or GPT-4o, we formulate dance synthesis as a graph optimization problem that seeks the optimal keyframe structure to satisfy a specified choreography pattern of beats. We also introduce an approach for mirrored pose image generation, essential for capturing symmetry in dance. In-between frames are synthesized using an video diffusion model. With as few as six input keyframes, our method can produce up to 30 seconds dance videos across a wide range of animals and music tracks.
翻译:我们提出了一种用于生成音乐同步、具有编舞意识的动物舞蹈视频的框架。该框架引入了编舞模式——即定义舞蹈长程结构的运动节拍结构化序列——作为舞蹈视频生成的一种新型高级控制信号。这些模式可从人类舞蹈视频中自动估计得出。从通过文本到图像提示或GPT-4o生成的若干代表不同动物姿态的关键帧出发,我们将舞蹈合成建模为一个图优化问题,旨在寻找满足指定节拍编舞模式的最优关键帧结构。我们还提出了一种镜像姿态图像生成方法,这对捕捉舞蹈中的对称性至关重要。中间帧通过视频扩散模型进行合成。仅需六个输入关键帧,我们的方法即可为多种动物和音乐曲目生成长达30秒的舞蹈视频。