面向开放世界人类动作分割的图卷积网络方法 (Towards Open-World Human Action Segmentation Using Graph Convolutional Networks)

Human-object interaction segmentation is a fundamental task of daily activity understanding, which plays a crucial role in applications such as assistive robotics, healthcare, and autonomous systems. Most existing learning-based methods excel in closed-world action segmentation, they struggle to generalize to open-world scenarios where novel actions emerge. Collecting exhaustive action categories for training is impractical due to the dynamic diversity of human activities, necessitating models that detect and segment out-of-distribution actions without manual annotation. To address this issue, we formally define the open-world action segmentation problem and propose a structured framework for detecting and segmenting unseen actions. Our framework introduces three key innovations: 1) an Enhanced Pyramid Graph Convolutional Network (EPGCN) with a novel decoder module for robust spatiotemporal feature upsampling. 2) Mixup-based training to synthesize out-of-distribution data, eliminating reliance on manual annotations. 3) A novel Temporal Clustering loss that groups in-distribution actions while distancing out-of-distribution samples. We evaluate our framework on two challenging human-object interaction recognition datasets: Bimanual Actions and 2 Hands and Object (H2O) datasets. Experimental results demonstrate significant improvements over state-of-the-art action segmentation models across multiple open-set evaluation metrics, achieving 16.9% and 34.6% relative gains in open-set segmentation (F1@50) and out-of-distribution detection performances (AUROC), respectively. Additionally, we conduct an in-depth ablation study to assess the impact of each proposed component, identifying the optimal framework configuration for open-world action segmentation.

翻译：人-物交互分割是日常活动理解的基础任务，在辅助机器人、医疗保健和自主系统等应用中具有关键作用。现有基于学习的方法在封闭世界动作分割中表现优异，但在面对新动作出现的开放世界场景时泛化能力不足。由于人类活动的动态多样性，收集详尽动作类别进行训练并不现实，因此需要能够在不依赖人工标注的情况下检测和分割分布外动作的模型。为解决这一问题，我们正式定义了开放世界动作分割问题，并提出一个用于检测和分割未见动作的结构化框架。该框架包含三项关键创新：1）引入具有新型解码器模块的增强金字塔图卷积网络（EPGCN），实现鲁棒的时空特征上采样；2）采用基于Mixup的训练方法合成分布外数据，消除对人工标注的依赖；3）提出新颖的时间聚类损失函数，在聚合分布内动作的同时分离分布外样本。我们在两个具有挑战性的人-物交互识别数据集（Bimanual Actions和2 Hands and Object (H2O)数据集）上评估了该框架。实验结果表明，在多项开放集评估指标上，我们的方法显著优于当前最先进的动作分割模型，在开放集分割（F1@50）和分布外检测性能（AUROC）上分别实现了16.9%和34.6%的相对提升。此外，我们通过深入的消融实验评估了各提出组件的影响，确定了开放世界动作分割的最优框架配置。