与合作分析-花粉视频合成合作共舞 (Image Comes Dancing with Collaborative Parsing-Flow Video Synthesis)

Transferring human motion from a source to a target person poses great potential in computer vision and graphics applications. A crucial step is to manipulate sequential future motion while retaining the appearance characteristic.Previous work has either relied on crafted 3D human models or trained a separate model specifically for each target person, which is not scalable in practice.This work studies a more general setting, in which we aim to learn a single model to parsimoniously transfer motion from a source video to any target person given only one image of the person, named as Collaborative Parsing-Flow Network (CPF-Net). The paucity of information regarding the target person makes the task particularly challenging to faithfully preserve the appearance in varying designated poses. To address this issue, CPF-Net integrates the structured human parsing and appearance flow to guide the realistic foreground synthesis which is merged into the background by a spatio-temporal fusion module. In particular, CPF-Net decouples the problem into stages of human parsing sequence generation, foreground sequence generation and final video generation. The human parsing generation stage captures both the pose and the body structure of the target. The appearance flow is beneficial to keep details in synthesized frames. The integration of human parsing and appearance flow effectively guides the generation of video frames with realistic appearance. Finally, the dedicated designed fusion network ensure the temporal coherence. We further collect a large set of human dancing videos to push forward this research field. Both quantitative and qualitative results show our method substantially improves over previous approaches and is able to generate appealing and photo-realistic target videos given any input person image. All source code and dataset will be released at https://github.com/xiezhy6/CPF-Net.

翻译：将人类的网络运动从源向目标人转移,在计算机视觉和图形应用中具有巨大的潜力。关键步骤是操纵未来连续的运动,同时保留外观特征。先前的工作要么依靠3D人类模型,要么专门为每个目标人培训一个单独的模型,而实际上无法缩放。这项工作研究一个更笼统的环境, 我们的目标是学习一个单一模型, 将移动从源视频向仅给一个目标人的一个图像, 名为合作剖析- Flow 网络( CPF- Net) 。有关目标人的信息匮乏使得任务特别艰巨, 以忠实保存不同指定外观的外观。为了解决这个问题, 伙伴关系- 网络整合了结构化的人类对面的分解和外观, 将一个实际化模型, 将连接到人类的直线序列生成和最后视频生成。我们的直径序列生成和最后视频生成和最终生成的图像生成和图像生成过程。大型生成阶段的生成阶段, 将实时的图像生成和图像结构结构, 将显示整个图像流流的图层结构。最后生成, 最后生成的图像结构, 将显示整个图像生成, 将显示整个图像生成的图像流流流结构, 将展示, 最后的图像结构, 将显示整个结构的图像结构, 将显示整个图像结构结构的图像结构, 将显示整个结构, 将显示整个图像流流流结构的图像结构, 将显示整个结构。