MRTNet:视频判决多分辨率时间定位模拟网络 (MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding)

Given an untrimmed video and natural language query, video sentence grounding aims to localize the target temporal moment in the video. Existing methods mainly tackle this task by matching and aligning semantics of the descriptive sentence and video segments on a single temporal resolution, while neglecting the temporal consistency of video content in different resolutions. In this work, we propose a novel multi-resolution temporal video sentence grounding network: MRTNet, which consists of a multi-modal feature encoder, a Multi-Resolution Temporal (MRT) module, and a predictor module. MRT module is an encoder-decoder network, and output features in the decoder part are in conjunction with Transformers to predict the final start and end timestamps. Particularly, our MRT module is hot-pluggable, which means it can be seamlessly incorporated into any anchor-free models. Besides, we utilize a hybrid loss to supervise cross-modal features in MRT module for more accurate grounding in three scales: frame-level, clip-level and sequence-level. Extensive experiments on three prevalent datasets have shown the effectiveness of MRTNet.

翻译：根据未剪接的视频和自然语言查询,视频句落地旨在将视频中的目标时间时刻定位到本地。现有方法主要通过将描述性句和视频段的语义与单个时间分辨率相匹配和对齐来完成这项任务,同时忽视不同分辨率视频内容的时间一致性。在这项工作中,我们提议建立一个新型的多分辨率时间视频句落地网络:MRTNet,由多式特征编码器、多分辨率时空模块和一个预测模块组成。MRT模块是一个编码器-decoder网络,而解码器部分的输出功能与变换器结合,以预测最终开始和结束时间戳。特别是,我们的MRT模块是热插的,这意味着它可以无缝地融入任何无锚模型。此外,我们利用混合损失来监督MRT模块的跨模式特征,以便在三个尺度上更精确地定位:框架级、剪接级和序列级。对三种流行数据集进行的广泛实验显示了MRTNet的有效性。

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

【CVPR 2022】基于层次化视觉语言知识蒸馏的开放词汇单阶段检测，Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning

专知会员服务

7+阅读 · 2022年3月19日

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

专知会员服务

28+阅读 · 2022年3月3日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日