With the explosive growth of video data in various complex scenarios, quickly retrieving group activities has become an urgent problem. However, many tasks can only retrieve videos focusing on an entire video, not the activity granularity. To solve this problem, we propose a new STVH (spatiotemporal interleaved video hashing) technique for the first time. Through a unified framework, the STVH simultaneously models individual object dynamics and group interactions, capturing the spatiotemporal evolution on both group visual features and positional features. Moreover, in real-life video retrieval scenarios, it may sometimes require activity features, while at other times, it may require visual features of objects. We then further propose a novel M-STVH (multi-focused spatiotemporal video hashing) as an enhanced version to handle this difficult task. The advanced method incorporates hierarchical feature integration through multi-focused representation learning, allowing the model to jointly focus on activity semantics features and object visual features. We conducted comparative experiments on publicly available datasets, and both STVH and M-STVH can achieve excellent results.
翻译:随着复杂场景下视频数据的爆炸式增长,快速检索群体活动已成为亟待解决的问题。然而,现有方法大多仅能针对整段视频进行检索,无法实现活动粒度的细粒度检索。为解决这一问题,我们首次提出一种新颖的STVH(时空交错视频哈希)技术。该技术通过统一框架,同时建模个体对象动态与群体交互关系,从群体视觉特征与位置特征两个维度捕捉时空演化模式。此外,在实际视频检索场景中,有时需要侧重活动语义特征,有时则需关注对象视觉特征。为此,我们进一步提出增强版本M-STVH(多焦点时空视频哈希)来处理这一复杂任务。该先进方法通过多焦点表征学习实现分层特征融合,使模型能够协同关注活动语义特征与对象视觉特征。我们在公开数据集上进行了对比实验,STVH与M-STVH均取得了优异性能。