We study the problem of self-supervised procedure learning, which discovers key steps and establishes their order from a set of unlabeled procedural videos. Previous procedure learning methods typically learn frame-to-frame correspondences between videos before determining key steps and their order. However, their performance often suffers from order variations, background/redundant frames, and repeated actions. To overcome these challenges, we propose a self-supervised procedure learning framework, which utilizes a fused Gromov-Wasserstein optimal transport formulation with a structural prior for computing frame-to-frame mapping between videos. However, optimizing exclusively for the above temporal alignment term may lead to degenerate solutions, where all frames are mapped to a small cluster in the embedding space and hence every video is associated with only one key step. To address that limitation, we further integrate a contrastive regularization term, which maps different frames to different points in the embedding space, avoiding the collapse to trivial solutions. Finally, we conduct extensive experiments on large-scale egocentric (i.e., EgoProceL) and third-person (i.e., ProceL and CrossTask) benchmarks to demonstrate superior performance by our approach against previous methods, including OPEL which relies on a traditional Kantorovich optimal transport formulation with an optimality prior.
翻译:暂无翻译