We introduce a unified framework to jointly model images, text, and human attention traces. Our work is built on top of the recent Localized Narratives annotation framework [30], where each word of a given caption is paired with a mouse trace segment. We propose two novel tasks: (1) predict a trace given an image and caption (i.e., visual grounding), and (2) predict a caption and a trace given only an image. Learning the grounding of each word is challenging, due to noise in the human-provided traces and the presence of words that cannot be meaningfully visually grounded. We present a novel model architecture that is jointly trained on dual tasks (controlled trace generation and controlled caption generation). To evaluate the quality of the generated traces, we propose a local bipartite matching (LBM) distance metric which allows the comparison of two traces of different lengths. Extensive experiments show our model is robust to the imperfect training data and outperforms the baselines by a clear margin. Moreover, we demonstrate that our model pre-trained on the proposed tasks can be also beneficial to the downstream task of COCO's guided image captioning. Our code and project page are publicly available.
翻译:我们引入一个统一框架, 共同模拟图像、 文本和人类关注痕迹。 我们的工作建在最近的本地化叙述说明框架 [30] 之上, 将给定标题的每个字都配上鼠标痕量段。 我们提出两个新任务:(1) 预测一个给定图像和字幕的痕迹( 视觉地面), (2) 预测一个字幕和只给定图像的痕迹。 学习每个词的地基具有挑战性, 因为人类提供的痕迹中的噪音以及存在无法产生有意义的视觉依据的单词。 我们展示了一个在双重任务( 受控的痕量生成和受控的标题生成)上经过联合培训的新型模型。 为了评估所生成的痕迹的质量, 我们提议了一个本地双边匹配距离的参数, 以便比较不同长度的两种痕迹。 广泛的实验显示我们的模型对不完善的培训数据非常有力, 并且以清晰的边距超越基线。 此外, 我们证明, 在拟议任务上预先培训的模型也可以对CO的下游任务( 受控的追踪和受控的图像生成) 。 我们的代码和项目页面可以公开查阅 。