通过生成标题将视觉比喻解释成单词 (Explaining Visual Biases as Words by Generating Captions)

We aim to diagnose the potential biases in image classifiers. To this end, prior works manually labeled biased attributes or visualized biased features, which need high annotation costs or are often ambiguous to interpret. Instead, we leverage two types (generative and discriminative) of pre-trained vision-language models to describe the visual bias as a word. Specifically, we propose bias-to-text (B2T), which generates captions of the mispredicted images using a pre-trained captioning model to extract the common keywords that may describe visual biases. Then, we categorize the bias type as spurious correlation or majority bias by checking if it is specific or agnostic to the class, based on the similarity of class-wise mispredicted images and the keyword upon a pre-trained vision-language joint embedding space, e.g., CLIP. We demonstrate that the proposed simple and intuitive scheme can recover well-known gender and background biases, and discover novel ones in real-world datasets. Moreover, we utilize B2T to compare the classifiers using different architectures or training methods. Finally, we show that one can obtain debiased classifiers using the B2T bias keywords and CLIP, in both zero-shot and full-shot manners, without using any human annotation on the bias.

翻译：我们的目标是分析图像分类中的潜在偏差。为此, 先前的手动手动将偏差属性或视觉偏差特征贴上标签, 需要高注解成本或往往难以解释。相反, 我们利用两种( 遗传的和歧视性的) 预培训前的视觉语言模型来将视觉偏差描述成单词。具体地说, 我们提出偏差- 文本( B2T), 产生错误图像的字幕, 使用预先培训的字幕模型来提取描述视觉偏差的通用关键词。然后, 我们使用 B2T 将偏差类型归类为虚假的关联性或多数偏差, 检查其是否具体或不可知性。我们根据类别错误图像的相似性, 以及预培训前的视觉语言联合嵌入空间( 例如 CLIP) 的关键词。我们证明, 拟议的简单和直觉的图案可以恢复众所周知的性别和背景偏差, 并在现实世界数据集中发现新的偏差。此外, 我们使用 B2T 使用不同的结构或培训方法, 来比较分类者,, 并且在不使用完全的 C- brialimers 和 C- hash- hustaged 中, 我们展示使用任何使用的的和整个的 C- dirmagial 。