We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework. In particular, we first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which trains efficiently on GPUs and needs only a few iterations for solving the optimal transport problem. Our single-task method achieves the state-of-the-art performance on multiple video alignment benchmarks and outperforms VAVA, which relies on a traditional Kantorovich optimal transport formulation with an optimality prior. Furthermore, we extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation, which requires training and storing a single model and saves both time and memory consumption as compared to two different single-task models. Extensive evaluations on several video alignment and action segmentation datasets demonstrate that our multi-task method achieves comparable video alignment yet superior action segmentation results over previous methods in video alignment and action segmentation respectively. Finally, to the best of our knowledge, this is the first work to unify video alignment and action segmentation into a single model. Our code is available on our research website: https://retrocausal.ai/research/.
翻译:本文提出了一种基于统一最优传输框架的同步自监督视频对齐与动作分割新方法。具体而言,我们首先通过构建融合Gromov-Wasserstein最优传输公式并引入结构先验来解决自监督视频对齐问题,该方法可在GPU上高效训练,且仅需少量迭代即可求解最优传输问题。我们的单任务方法在多个视频对齐基准测试中达到最先进性能,并优于依赖传统Kantorovich最优传输公式与最优性先验的VAVA方法。进一步地,我们扩展该方法提出统一最优传输框架,用于联合自监督视频对齐与动作分割,相比两个独立的单任务模型,仅需训练存储单一模型,显著节省时间与内存消耗。在多个视频对齐与动作分割数据集上的广泛评估表明,我们的多任务方法在视频对齐方面取得可比性能,同时在动作分割方面优于以往分别针对视频对齐与动作分割的独立方法。据我们所知,这是首次将视频对齐与动作分割统一于单一模型的研究。代码已发布于研究网站:https://retrocausal.ai/research/。