实例级组合图像检索 (Instance-Level Composed Image Retrieval)

The progress of composed image retrieval (CIR), a popular research direction in image retrieval, where a combined visual and textual query is used, is held back by the absence of high-quality training and evaluation data. We introduce a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an instance-level class definition. The goal is to retrieve images that contain the same particular object as the visual query, presented under a variety of modifications defined by textual queries. Its design and curation process keep the dataset compact to facilitate future research, while maintaining its challenge-comparable to retrieval among more than 40M random distractors-through a semi-automated selection of hard negatives. To overcome the challenge of obtaining clean, diverse, and suitable training data, we leverage pre-trained vision-and-language models (VLMs) in a training-free approach called BASIC. The method separately estimates query-image-to-image and query-text-to-image similarities, performing late fusion to upweight images that satisfy both queries, while down-weighting those that exhibit high similarity with only one of the two. Each individual similarity is further improved by a set of components that are simple and intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR datasets that follow a semantic-level class definition. Project page: https://vrg.fel.cvut.cz/icir/.

翻译：组合图像检索（CIR）作为图像检索领域的热门研究方向，通过结合视觉与文本查询进行检索，但其进展受限于高质量训练与评估数据的缺乏。我们引入了一个新的评估数据集 i-CIR，与现有数据集不同，该数据集聚焦于实例级的类别定义。其目标是检索出包含与视觉查询中相同特定对象的图像，这些图像需符合文本查询所定义的各种修改条件。数据集的设计与构建过程保持了紧凑性以促进未来研究，同时通过半自动筛选困难负样本，使其挑战性堪比在超过 4000 万随机干扰项中进行检索。为克服获取干净、多样且适用训练数据的挑战，我们利用预训练的视觉-语言模型（VLM），提出一种无需训练的方法 BASIC。该方法分别估计查询图像到图像以及查询文本到图像的相似度，通过后期融合对同时满足两个查询的图像进行加权提升，而对仅与单一查询高度相似的图像进行降权处理。每个独立相似度还通过一组简单直观的组件进一步优化。BASIC 不仅在 i-CIR 上取得了新的最优性能，在遵循语义级类别定义的现有 CIR 数据集上也实现了领先水平。项目页面：https://vrg.fel.cvut.cz/icir/。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日