Fine-grained knowledge is crucial for vision-language models to obtain a better understanding of the real world. While there has been work trying to acquire this kind of knowledge in the space of vision and language, it has mostly focused on aligning the image patches with the tokens on the language side. However, image patches do not have any meaning to the human eye, and individual tokens do not necessarily carry groundable information in the image. It is groups of tokens which describe different aspects of the scene. In this work, we propose a model which groups the caption tokens as part of its architecture in order to capture a fine-grained representation of the language. We expect our representations to be at the level of objects present in the image, and therefore align our representations with the output of an image encoder trained to discover objects. We show that by learning to group the tokens, the vision-language model has a better fine-grained understanding of vision and language. In addition, the token groups that our model discovers are highly similar to groundable phrases in text, both qualitatively and quantitatively.
翻译:细粒度知识对于视觉-语言模型更好地理解现实世界至关重要。尽管已有研究尝试在视觉与语言空间中获取此类知识,但大多集中于将图像块与语言侧的标记进行对齐。然而,图像块对人眼而言并无明确意义,且单个标记未必承载图像中可被视觉基础化的信息。描述场景不同方面的往往是标记的组合。在本研究中,我们提出一种模型,其架构通过组合标题标记来捕捉语言的细粒度表征。我们期望这些表征能够对应图像中存在的物体层级,因此将我们的表征与经过训练以发现物体的图像编码器输出进行对齐。研究表明,通过学习组合标记,视觉-语言模型能获得更优的视觉与语言细粒度理解能力。此外,我们的模型所发现的标记组合在定性与定量分析中均与文本中可被视觉基础化的短语高度相似。