基于强化学习的视频内容重建与表征用于字幕生成 (Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning)

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) in a novel encoder-decoder-reconstructor architecture, which leverages both forward (video to sentence) and backward (sentence to video) flows for video captioning. Specifically, the encoder-decoder component makes use of the forward flow to produce a sentence description based on the encoded video semantic features. Two types of reconstructors are subsequently proposed to employ the backward flow and reproduce the video features from local and global perspectives, respectively, capitalizing on the hidden state sequence generated by the decoder. Moreover, in order to make a comprehensive reconstruction of the video features, we propose to fuse the two types of reconstructors together. The generation loss yielded by the encoder-decoder component and the reconstruction loss introduced by the reconstructor are jointly cast into training the proposed RecNet in an end-to-end fashion. Furthermore, the RecNet is fine-tuned by CIDEr optimization via reinforcement learning, which significantly boosts the captioning performance. Experimental results on benchmark datasets demonstrate that the proposed reconstructor can boost the performance of video captioning consistently.

翻译：本文探讨了使用自然语言描述视频序列视觉内容的问题。与以往主要利用视频内容线索生成语言描述的视频字幕研究工作不同，我们提出了一种新颖的编码器-解码器-重建器架构中的重建网络（RecNet），该网络同时利用前向（视频到句子）与后向（句子到视频）信息流进行视频字幕生成。具体而言，编码器-解码器组件利用前向信息流，基于编码后的视频语义特征生成句子描述。随后，我们提出两种类型的重建器，分别从局部和全局视角利用解码器生成的隐藏状态序列，通过后向信息流重建视频特征。此外，为实现对视频特征的全面重建，我们提出将两类重建器进行融合。编码器-解码器组件产生的生成损失与重建器引入的重建损失被共同用于以端到端方式训练所提出的RecNet。进一步地，RecNet通过强化学习进行CIDEr指标优化的微调，显著提升了字幕生成性能。在基准数据集上的实验结果表明，所提出的重建器能够持续提升视频字幕生成的性能。