Reliable co-speech motion generation requires precise motion representation and consistent structural priors across all joints. Existing generative methods typically operate on local joint rotations, which are defined hierarchically based on the skeleton structure. This leads to cumulative errors during generation, manifesting as unstable and implausible motions at end-effectors. In this work, we propose GlobalDiff, a diffusion-based framework that operates directly in the space of global joint rotations for the first time, fundamentally decoupling each joint's prediction from upstream dependencies and alleviating hierarchical error accumulation. To compensate for the absence of structural priors in global rotation space, we introduce a multi-level constraint scheme. Specifically, a joint structure constraint introduces virtual anchor points around each joint to better capture fine-grained orientation. A skeleton structure constraint enforces angular consistency across bones to maintain structural integrity. A temporal structure constraint utilizes a multi-scale variational encoder to align the generated motion with ground-truth temporal patterns. These constraints jointly regularize the global diffusion process and reinforce structural awareness. Extensive evaluations on standard co-speech benchmarks show that GlobalDiff generates smooth and accurate motions, improving the performance by 46.0 % compared to the current SOTA under multiple speaker identities.
翻译:可靠的协同语音运动生成需要精确的运动表示及所有关节间一致的结构先验。现有生成方法通常基于局部关节旋转进行操作,这些旋转根据骨架结构分层定义。这导致生成过程中误差累积,表现为末端执行器出现不稳定且不合理的运动。在本研究中,我们首次提出GlobalDiff,一种直接在全局关节旋转空间中操作的基于扩散的框架,从根本上将每个关节的预测与上游依赖解耦,从而缓解分层误差累积。为弥补全局旋转空间中结构先验的缺失,我们引入了一种多级约束方案。具体而言,关节结构约束在每个关节周围引入虚拟锚点以更好地捕捉细粒度朝向;骨架结构约束强制骨骼间的角度一致性以保持结构完整性;时间结构约束利用多尺度变分编码器将生成的运动与真实时间模式对齐。这些约束共同正则化全局扩散过程并增强结构感知能力。在标准协同语音基准上的广泛评估表明,GlobalDiff能生成平滑且精确的运动,在多种说话者身份下,其性能较当前最优方法提升了46.0%。