SemCo：面向语义连贯的视觉关系预测 (SemCo: Toward Semantic Coherent Visual Relationship Forecasting)

Visual Relationship Forecasting (VRF) aims to anticipate relations among objects without observing future visual content. The task relies on capturing and modeling the semantic coherence in object interactions, as it underpins the evolution of events and scenes in videos. However, existing VRF datasets offer limited support for learning such coherence due to noisy annotations in the datasets and weak correlations between different actions and relationship transitions in subject-object pair. Furthermore, existing methods struggle to distinguish similar relationships and overfit to unchanging relationships in consecutive frames. To address these challenges, we present SemCoBench, a benchmark that emphasizes semantic coherence for visual relationship forecasting. Based on action labels and short-term subject-object pairs, SemCoBench decomposes relationship categories and dynamics by cleaning and reorganizing video datasets to ensure predicting semantic coherence in object interactions. In addition, we also present Semantic Coherent Transformer method (SemCoFormer) to model the semantic coherence with a Relationship Augmented Module (RAM) and a Coherence Reasoning Module (CRM). RAM is designed to distinguish similar relationships, and CRM facilitates the model's focus on the dynamics in relationships. The experimental results on SemCoBench demonstrate that modeling the semantic coherence is a key step toward reasonable, fine-grained, and diverse visual relationship forecasting, contributing to a more comprehensive understanding of video scenes.

翻译：视觉关系预测（VRF）旨在无需观察未来视觉内容的情况下，预测物体间的关系。该任务依赖于捕捉和建模物体交互中的语义连贯性，因为这种连贯性支撑着视频中事件与场景的演化。然而，现有VRF数据集由于标注噪声以及主体-客体对中不同动作与关系转换之间的弱相关性，对学习此类连贯性的支持有限。此外，现有方法难以区分相似关系，并容易对连续帧中不变的关系产生过拟合。为应对这些挑战，我们提出了SemCoBench，一个强调语义连贯性的视觉关系预测基准。基于动作标签和短期主体-客体对，SemCoBench通过清理和重组视频数据集来分解关系类别与动态，以确保预测物体交互中的语义连贯性。此外，我们还提出了语义连贯Transformer方法（SemCoFormer），通过关系增强模块（RAM）和连贯性推理模块（CRM）来建模语义连贯性。RAM旨在区分相似关系，而CRM则促进模型关注关系中的动态变化。在SemCoBench上的实验结果表明，建模语义连贯性是实现合理、细粒度且多样化的视觉关系预测的关键步骤，有助于更全面地理解视频场景。