The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: (1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. (2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.
翻译:基于提升的方法通过利用检测到的二维姿态作为中间表示,主导了单目三维人体姿态估计领域。最终三维人体姿态的二维分量得益于检测到的二维姿态,而其深度分量则必须从零开始估计。基于提升的方法将检测到的二维姿态与未知深度编码在一个纠缠的特征空间中,这显式地将深度不确定性引入到检测到的二维姿态中,从而限制了整体估计精度。本工作揭示了深度表示对于估计过程至关重要。具体而言,当深度处于初始完全未知状态时,将深度特征与二维姿态特征联合编码会对估计过程产生不利影响。相反,当深度通过基于网络的估计被初步优化至更可靠的状态时,将其与二维姿态信息共同编码则是有益的。为克服这一局限,我们提出了一种用于单目三维姿态估计的专家混合网络,命名为PoseMoE。我们的方法引入了:(1)一个专家混合网络,其中专门的专家模块负责优化已良好检测的二维姿态特征并学习深度特征。这种专家混合设计解耦了二维姿态与深度的特征编码过程,从而减少了不确定深度特征对二维姿态特征的显式影响。(2)提出了一个跨专家知识聚合模块,用于聚合跨专家的时空上下文信息。该步骤通过二维姿态与深度之间的双向映射来增强特征。大量实验表明,我们提出的PoseMoE在三个广泛使用的数据集:Human3.6M、MPI-INF-3DHP和3DPW上,均优于传统的基于提升的方法。