MotionGPT3：以人体运动作为第二模态 (MotionGPT3: Human Motion as a Second Modality)

With the rapid progress of large language models (LLMs), multimodal frameworks that unify understanding and generation have become promising, yet they face increasing complexity as the number of modalities and tasks grows. We observe that motion quantization introduces approximation errors that cap motion quality, and that unifying discrete text and continuous motion within a single-stream backbone amplifies cross-modal interference. Motivated by recent multi-branch Transformer designs that separate signals from different modalities, we propose MotionGPT3, a bimodal motion-language model for both understanding and generation. MotionGPT3 encodes raw motion into a continuous latent space using a variational autoencoder (VAE), thereby avoiding quantization-induced artifacts, while leveraging the semantic prior of pretrained language models. A dual-stream Transformer with shared attention preserves modality-specific routes while enabling controlled, bidirectional information flow, which reduces interference, stabilizing optimization, and empirically accelerates convergence without degrading fidelity. For multimodal joint training, a generate-then-align three-stage schedule further improves stability and limits cross-task interference. Experiments show that MotionGPT3 achieves 2x faster convergence in training loss and up to 4x faster convergence in validation, while maintaining state-of-the-art performance on standard motion understanding and motion generation benchmarks.

翻译：随着大语言模型（LLMs）的快速发展，统一理解与生成的多模态框架展现出广阔前景，但随着模态数量和任务复杂度的增加，其面临的挑战也日益加剧。我们观察到，运动量化会引入近似误差，从而限制运动质量；同时，将离散文本与连续运动统一在单流骨干网络中会加剧跨模态干扰。受近期分离不同模态信号的多分支Transformer设计的启发，我们提出了MotionGPT3——一个面向理解与生成的双模态运动-语言模型。MotionGPT3通过变分自编码器（VAE）将原始运动编码至连续潜空间，从而避免量化引入的伪影，同时利用预训练语言模型的语义先验。采用共享注意力的双流Transformer保留了模态特定的路径，同时实现了可控的双向信息流，这减少了干扰、稳定了优化过程，并在经验上加速了收敛而不损失保真度。针对多模态联合训练，一种“生成-对齐”的三阶段调度策略进一步提升了稳定性并限制了跨任务干扰。实验表明，MotionGPT3在训练损失上实现了2倍的收敛加速，在验证集上收敛速度最高提升4倍，同时在标准的运动理解与运动生成基准测试中保持了最先进的性能。