Egocentric human pose estimation (HPE) using wearable sensors is essential for VR/AR applications. Most methods rely solely on either egocentric-view images or sparse Inertial Measurement Unit (IMU) signals, leading to inaccuracies due to self-occlusion in images or the sparseness and drift of inertial sensors. Most importantly, the lack of real-world datasets containing both modalities is a major obstacle to progress in this field. To overcome the barrier, we propose EMHI, a multimodal \textbf{E}gocentric human \textbf{M}otion dataset with \textbf{H}ead-Mounted Display (HMD) and body-worn \textbf{I}MUs, with all data collected under the real VR product suite. Specifically, EMHI provides synchronized stereo images from downward-sloping cameras on the headset and IMU data from body-worn sensors, along with pose annotations in SMPL format. This dataset consists of 885 sequences captured by 58 subjects performing 39 actions, totaling about 28.5 hours of recording. We evaluate the annotations by comparing them with optical marker-based SMPL fitting results. To substantiate the reliability of our dataset, we introduce MEPoser, a new baseline method for multimodal egocentric HPE, which employs a multimodal fusion encoder, temporal feature encoder, and MLP-based regression heads. The experiments on EMHI show that MEPoser outperforms existing single-modal methods and demonstrates the value of our dataset in solving the problem of egocentric HPE. We believe the release of EMHI and the method could advance the research of egocentric HPE and expedite the practical implementation of this technology in VR/AR products.
翻译:利用可穿戴传感器进行第一人称人体姿态估计对于VR/AR应用至关重要。现有方法大多仅依赖第一人称视角图像或稀疏的惯性测量单元信号,导致因图像中的自遮挡或惯性传感器的稀疏性与漂移而产生误差。最重要的是,缺乏同时包含这两种模态的真实世界数据集是该领域进展的主要障碍。为突破此障碍,我们提出了EMHI,一个包含头戴式显示器与身体佩戴惯性测量单元的多模态第一人称人体运动数据集,所有数据均在真实VR产品套件下采集。具体而言,EMHI提供了头显上倾斜向下摄像头的同步立体图像、身体佩戴传感器的IMU数据,以及SMPL格式的姿态标注。该数据集包含58名受试者执行39种动作捕获的885个序列,总录制时长约28.5小时。我们通过将标注结果与基于光学标记的SMPL拟合结果进行比较来评估其质量。为验证数据集的可靠性,我们提出了MEPoser——一种新的多模态第一人称人体姿态估计基线方法,该方法采用多模态融合编码器、时序特征编码器和基于MLP的回归头。在EMHI上的实验表明,MEPoser优于现有单模态方法,并证明了本数据集在解决第一人称人体姿态估计问题中的价值。我们相信,EMHI数据集及方法的发布将推动第一人称人体姿态估计的研究,并加速该技术在VR/AR产品中的实际应用。