The progress of composed image retrieval (CIR), a popular research direction in image retrieval, where a combined visual and textual query is used, is held back by the absence of high-quality training and evaluation data. We introduce a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an instance-level class definition. The goal is to retrieve images that contain the same particular object as the visual query, presented under a variety of modifications defined by textual queries. Its design and curation process keep the dataset compact to facilitate future research, while maintaining its challenge-comparable to retrieval among more than 40M random distractors-through a semi-automated selection of hard negatives. To overcome the challenge of obtaining clean, diverse, and suitable training data, we leverage pre-trained vision-and-language models (VLMs) in a training-free approach called BASIC. The method separately estimates query-image-to-image and query-text-to-image similarities, performing late fusion to upweight images that satisfy both queries, while down-weighting those that exhibit high similarity with only one of the two. Each individual similarity is further improved by a set of components that are simple and intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR datasets that follow a semantic-level class definition. Project page: https://vrg.fel.cvut.cz/icir/.
翻译:组合图像检索(CIR)作为图像检索领域的热门研究方向,通过结合视觉与文本查询进行检索,但其进展受限于高质量训练与评估数据的缺乏。我们引入了一个新的评估数据集 i-CIR,与现有数据集不同,该数据集聚焦于实例级的类别定义。其目标是检索出包含与视觉查询中相同特定对象的图像,这些图像需符合文本查询所定义的各种修改条件。数据集的设计与构建过程保持了紧凑性以促进未来研究,同时通过半自动筛选困难负样本,使其挑战性堪比在超过 4000 万随机干扰项中进行检索。为克服获取干净、多样且适用训练数据的挑战,我们利用预训练的视觉-语言模型(VLM),提出一种无需训练的方法 BASIC。该方法分别估计查询图像到图像以及查询文本到图像的相似度,通过后期融合对同时满足两个查询的图像进行加权提升,而对仅与单一查询高度相似的图像进行降权处理。每个独立相似度还通过一组简单直观的组件进一步优化。BASIC 不仅在 i-CIR 上取得了新的最优性能,在遵循语义级类别定义的现有 CIR 数据集上也实现了领先水平。项目页面:https://vrg.fel.cvut.cz/icir/。