3D human motion prediction aims to generate coherent future motions from observed sequences, yet existing end-to-end regression frameworks often fail to capture complex dynamics and tend to produce temporally inconsistent or static predictions-a limitation rooted in representation shortcutting, where models rely on superficial cues rather than learning meaningful motion structure. We propose a two-stage self-supervised framework that decouples representation learning from prediction. In the pretraining stage, the model performs unified past-future self-reconstruction, reconstructing the past sequence while recovering masked joints in the future sequence under full historical guidance. A velocity-based masking strategy selects highly dynamic joints, forcing the model to focus on informative motion components and internalize the statistical dependencies between past and future states without regression interference. In the fine-tuning stage, the pretrained model predicts the entire future sequence, now treated as fully masked, and is further equipped with a lightweight future-text prediction head for joint optimization of low-level motion prediction and high-level motion understanding. Experiments on Human3.6M, 3DPW, and AMASS show that our method reduces average prediction errors by 8.8% over state-of-the-art methods while achieving competitive future-motion understanding performance compared to LLM-based models. Code is available at: https://github.com/JunyuShi02/MoReFun
翻译:三维人体运动预测旨在从观测序列中生成连贯的未来运动,然而现有的端到端回归框架往往难以捕捉复杂的动态特性,并倾向于产生时间不一致或静态的预测——这一局限源于表示捷径问题,即模型依赖表面线索而非学习有意义的运动结构。我们提出一种两阶段自监督框架,将表示学习与预测任务解耦。在预训练阶段,模型执行统一的过去-未来自重建任务,在完整历史信息的引导下重建过去序列并恢复未来序列中被遮蔽的关节点。基于速度的遮蔽策略选择高动态性关节点,迫使模型聚焦于信息丰富的运动成分,并在无回归干扰的情况下内化过去与未来状态间的统计依赖关系。在微调阶段,预训练模型将整个未来序列视为完全遮蔽进行预测,并进一步配备轻量级未来文本预测头,以实现底层运动预测与高层运动理解的联合优化。在Human3.6M、3DPW和AMASS数据集上的实验表明,本方法在平均预测误差上较现有最优方法降低8.8%,同时相比基于LLM的模型实现了具有竞争力的未来运动理解性能。代码发布于:https://github.com/JunyuShi02/MoReFun