Open-domain visual entity recognition aims to identify and link entities depicted in images to a vast and evolving set of real-world concepts, such as those found in Wikidata. Unlike conventional classification tasks with fixed label sets, it operates under open-set conditions, where most target entities are unseen during training and exhibit long-tail distributions. This makes the task inherently challenging due to limited supervision, high visual ambiguity, and the need for semantic disambiguation. We propose a Knowledge-guided Contrastive Learning (KnowCoL) framework that combines both images and text descriptions into a shared semantic space grounded by structured information from Wikidata. By abstracting visual and textual inputs to a conceptual level, the model leverages entity descriptions, type hierarchies, and relational context to support zero-shot entity recognition. We evaluate our approach on the OVEN benchmark, a large-scale open-domain visual recognition dataset with Wikidata IDs as the label space. Our experiments show that using visual, textual, and structured knowledge greatly improves accuracy, especially for rare and unseen entities. Our smallest model improves the accuracy on unseen entities by 10.5% compared to the state-of-the-art, despite being 35 times smaller.
翻译:开放域视觉实体识别旨在识别图像中描绘的实体,并将其链接到广泛且不断演化的现实世界概念集合(如Wikidata中的概念)。与具有固定标签集的传统分类任务不同,该任务在开放集条件下运行,其中大多数目标实体在训练期间未见过,并呈现长尾分布。由于监督有限、视觉歧义性高以及需要进行语义消歧,该任务本质上具有挑战性。我们提出了一个知识引导的对比学习(KnowCoL)框架,该框架将图像和文本描述结合到一个基于Wikidata结构化信息的共享语义空间中。通过将视觉和文本输入抽象到概念层面,模型利用实体描述、类型层次结构和关系上下文来支持零样本实体识别。我们在OVEN基准测试上评估了我们的方法,这是一个以Wikidata ID为标签空间的大规模开放域视觉识别数据集。实验表明,结合视觉、文本和结构化知识显著提高了准确性,特别是对于罕见和未见过的实体。我们最小的模型在未见过的实体上的准确率比现有最优方法提高了10.5%,尽管其规模小了35倍。