以模式关注促进情感理解的多性别网络 (Multi-Granularity Network with Modal Attention for Dense Affective Understanding)

Video affective understanding, which aims to predict the evoked expressions by the video content, is desired for video creation and recommendation. In the recent EEV challenge, a dense affective understanding task is proposed and requires frame-level affective prediction. In this paper, we propose a multi-granularity network with modal attention (MGN-MA), which employs multi-granularity features for better description of the target frame. Specifically, the multi-granularity features could be divided into frame-level, clips-level and video-level features, which corresponds to visual-salient content, semantic-context and video theme information. Then the modal attention fusion module is designed to fuse the multi-granularity features and emphasize more affection-relevant modals. Finally, the fused feature is fed into a Mixtures Of Experts (MOE) classifier to predict the expressions. Further employing model-ensemble post-processing, the proposed method achieves the correlation score of 0.02292 in the EEV challenge.

翻译：视频感知理解(MGN-MA)旨在预测视频内容所引用的表达方式,目的是预测视频创建和建议。在最近的 EEV 挑战中,提出了密集感知任务,需要框架一级的感知预测。在本文中,我们提议建立一个多色网络(MGN-MA),采用多色特征来更好地描述目标框架。具体地说,多色特征可以分为框架级别、剪辑级别和视频级别特征,这与直观内容、语义和视频主题信息相对应。然后,模型感知聚合模块的设计是为了结合多色特征,强调更贴切的模型。最后,结合特征被注入专家混合(MOE)分类器以预测表达方式。进一步使用模型集成式后处理,拟议方法在 EEV 挑战中达到了0.02292的对应分。