Multimodal contrastive learning models (e.g., CLIP) can learn high-quality representations from large-scale image-text datasets, while they exhibit significant vulnerabilities to backdoor attacks, raising serious safety concerns. In this paper, we reveal that CLIP's vulnerabilities primarily stem from its tendency to encode features beyond in-dataset predictive patterns, compromising its visual feature resistivity to input perturbations. This makes its encoded features highly susceptible to being reshaped by backdoor triggers. To address this challenge, we propose Repulsive Visual Prompt Tuning (RVPT), a novel defense approach that employs deep visual prompt tuning with a specially designed feature-repelling loss. Specifically, RVPT adversarially repels the encoded features from deeper layers while optimizing the standard cross-entropy loss, ensuring that only predictive features in downstream tasks are encoded, thereby enhancing CLIP's visual feature resistivity against input perturbations and mitigating its susceptibility to backdoor attacks. Unlike existing multimodal backdoor defense methods that typically require the availability of poisoned data or involve fine-tuning the entire model, RVPT leverages few-shot downstream clean samples and only tunes a small number of parameters. Empirical results demonstrate that RVPT tunes only 0.27\% of the parameters in CLIP, yet it significantly outperforms state-of-the-art defense methods, reducing the attack success rate from 89.70\% to 2.76\% against the most advanced multimodal attacks on ImageNet and effectively generalizes its defensive capabilities across multiple datasets.
翻译:多模态对比学习模型(如CLIP)能够从大规模图文数据集中学习高质量的表征,但它们对后门攻击表现出显著的脆弱性,引发了严重的安全担忧。本文揭示,CLIP的脆弱性主要源于其倾向于编码超出数据集内预测模式的特征,这削弱了其视觉特征对输入扰动的抵抗力,使其编码特征极易被后门触发器重塑。为应对这一挑战,我们提出了排斥性视觉提示调优(RVPT),一种新颖的防御方法,采用深度视觉提示调优并结合专门设计的特征排斥损失。具体而言,RVPT在优化标准交叉熵损失的同时,对抗性地排斥深层编码特征,确保仅编码下游任务中的预测特征,从而增强CLIP视觉特征对输入扰动的抵抗力,并降低其后门攻击的易感性。与现有通常需要中毒数据或涉及全模型微调的多模态后门防御方法不同,RVPT利用少量下游干净样本,仅调优少量参数。实证结果表明,RVPT仅调优CLIP中0.27%的参数,却显著优于最先进的防御方法,在ImageNet上针对最先进的多模态攻击,将攻击成功率从89.70%降至2.76%,并有效将其防御能力泛化至多个数据集。