Automatic video captioning aims for a holistic visual scene understanding. It requires a mechanism for capturing temporal context in video frames and the ability to comprehend the actions and associations of objects in a given timeframe. Such a system should additionally learn to abstract video sequences into sensible representations as well as to generate natural written language. While the majority of captioning models focus solely on the visual inputs, little attention has been paid to the audiovisual modality. To tackle this issue, we propose a novel two-fold approach. First, we implement a reward-guided KL Divergence to train a video captioning model which is resilient towards token permutations. Second, we utilise a Bi-Modal Hierarchical Reinforcement Learning (BMHRL) Transformer architecture to capture long-term temporal dependencies of the input data as a foundation for our hierarchical captioning module. Using our BMHRL, we show the suitability of the HRL agent in the generation of content-complete and grammatically sound sentences by achieving $4.91$, $2.23$, and $10.80$ in BLEU3, BLEU4, and METEOR scores, respectively on the ActivityNet Captions dataset. Finally, we make our BMHRL framework and trained models publicly available for users and developers at https://github.com/d-rothen/bmhrl.
翻译:自动视频字幕的目的是全面视觉场景理解。 它需要一个在视频框中捕捉时间背景的机制, 以及理解特定时间框架内物体行动和关联的能力。 这样的系统应额外学习将视频序列抽象成合理表达方式, 并生成自然书面语言。 虽然大多数字幕模型只关注视觉输入, 但很少关注视听模式。 为了解决这一问题, 我们提议了一个新的双管齐下的方法。 首先, 我们实施了一个以奖励为指南的 KL 差异, 以在视频框中捕捉时间背景和理解特定时间框架内物体行动和关联的能力。 其次, 我们使用双模级高度强化学习( BMHRL) 变异结构来捕捉输入数据的长期时间依赖性, 作为我们分级字幕模块的基础。 我们使用 BMHRL, 我们展示了HR代理在内容完整和语法声音生成过程中的合适性, 我们实施了4.91美元, 223美元, 10.80美元的视频字幕模型, BLEU4, 以及 METEROR 评分, 分别用于 活动网络/ 数据库 和 数据库 。