Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA can generate diverse, realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our code and model are publicly available at https://github.com/xsy27/Mospa-Acoustic-driven-Motion-Generation
翻译:使虚拟人能够动态且真实地对多样化的听觉刺激作出响应,仍然是角色动画中的一个关键挑战,这需要整合感知建模与动作合成。尽管其重要性显著,该任务在很大程度上仍未被充分探索。先前的研究大多集中于将语音、音频和音乐等模态映射以生成人类动作。迄今为止,这些模型通常忽略了空间音频信号中编码的空间特征对人类动作的影响。为填补这一空白,并实现对空间音频响应的高质量人类动作建模,我们引入了首个全面的空间音频驱动人类动作(SAM)数据集,该数据集包含多样且高质量的空间音频与动作数据。为建立基准,我们开发了一个简单而有效的基于扩散的生成框架,用于空间音频驱动的人类动作生成,称为MOSPA,它通过有效的融合机制准确捕捉了身体动作与空间音频之间的关系。训练完成后,MOSPA能够根据不同的空间音频输入生成多样且真实的人类动作。我们对所提出的数据集进行了深入研究,并进行了广泛的实验以建立基准,其中我们的方法在该任务上实现了最先进的性能。我们的代码和模型已在https://github.com/xsy27/Mospa-Acoustic-driven-Motion-Generation公开提供。