MOSPA：空间音频驱动的人类动作生成 (MOSPA: Human Motion Generation Driven by Spatial Audio)

Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA can generate diverse, realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our code and model are publicly available at https://github.com/xsy27/Mospa-Acoustic-driven-Motion-Generation

翻译：使虚拟人能够动态且真实地对多样化的听觉刺激作出响应，仍然是角色动画中的一个关键挑战，这需要整合感知建模与动作合成。尽管其重要性显著，该任务在很大程度上仍未被充分探索。先前的研究大多集中于将语音、音频和音乐等模态映射以生成人类动作。迄今为止，这些模型通常忽略了空间音频信号中编码的空间特征对人类动作的影响。为填补这一空白，并实现对空间音频响应的高质量人类动作建模，我们引入了首个全面的空间音频驱动人类动作（SAM）数据集，该数据集包含多样且高质量的空间音频与动作数据。为建立基准，我们开发了一个简单而有效的基于扩散的生成框架，用于空间音频驱动的人类动作生成，称为MOSPA，它通过有效的融合机制准确捕捉了身体动作与空间音频之间的关系。训练完成后，MOSPA能够根据不同的空间音频输入生成多样且真实的人类动作。我们对所提出的数据集进行了深入研究，并进行了广泛的实验以建立基准，其中我们的方法在该任务上实现了最先进的性能。我们的代码和模型已在https://github.com/xsy27/Mospa-Acoustic-driven-Motion-Generation公开提供。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【ACMMM2025】EvoVLMA: 进化式视觉-语言模型自适应

专知会员服务

15+阅读 · 8月5日

[ICCV2025]EAMamba：面向图像恢复的高效全能视觉状态空间模型

专知会员服务

5+阅读 · 7月1日

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日

【CVPR 2022】MixFormer：跨窗口与维度的特征融合，MixFormer: Mixing Features across Windows and Dimensions

专知会员服务

15+阅读 · 2022年3月19日