Model inversion (MI) attacks pose significant privacy risks by reconstructing private training data from trained neural networks. While prior studies have primarily examined unimodal deep networks, the vulnerability of vision-language models (VLMs) remains largely unexplored. In this work, we present the first systematic study of MI attacks on VLMs to understand their susceptibility to leaking private visual training data. Our work makes two main contributions. First, tailored to the token-generative nature of VLMs, we introduce a suite of token-based and sequence-based model inversion strategies, providing a comprehensive analysis of VLMs' vulnerability under different attack formulations. Second, based on the observation that tokens vary in their visual grounding, and hence their gradients differ in informativeness for image reconstruction, we propose Sequence-based Model Inversion with Adaptive Token Weighting (SMI-AW) as a novel MI for VLMs. SMI-AW dynamically reweights each token's loss gradient according to its visual grounding, enabling the optimization to focus on visually informative tokens and more effectively guide the reconstruction of private images. Through extensive experiments and human evaluations on a range of state-of-the-art VLMs across multiple datasets, we show that VLMs are susceptible to training data leakage. Human evaluation of the reconstructed images yields an attack accuracy of 61.21%, underscoring the severity of these privacy risks. Notably, we demonstrate that publicly released VLMs are vulnerable to such attacks. Our study highlights the urgent need for privacy safeguards as VLMs become increasingly deployed in sensitive domains such as healthcare and finance. Additional experiments are provided in Supp.
翻译:模型反演攻击通过从训练好的神经网络中重构私有训练数据,构成了严重的隐私风险。先前的研究主要针对单模态深度网络,而视觉语言模型的脆弱性在很大程度上尚未被探索。本文首次对视觉语言模型进行模型反演攻击的系统性研究,以理解其泄露私有视觉训练数据的敏感性。我们的工作主要有两个贡献。首先,针对视觉语言模型的令牌生成特性,我们引入了一套基于令牌和序列的模型反演策略,全面分析了视觉语言模型在不同攻击形式下的脆弱性。其次,基于观察到令牌在视觉基础性上存在差异,因此其梯度在图像重构中的信息量不同,我们提出了基于序列的自适应令牌加权模型反演作为一种新颖的视觉语言模型反演方法。该方法根据令牌的视觉基础性动态调整每个令牌的损失梯度权重,使优化过程聚焦于视觉信息丰富的令牌,从而更有效地指导私有图像的重构。通过在多个数据集上对一系列先进视觉语言模型进行大量实验和人工评估,我们发现视觉语言模型容易泄露训练数据。对重构图像的人工评估显示攻击准确率达到61.21%,突显了这些隐私风险的严重性。值得注意的是,我们证明公开发布的视觉语言模型也容易受到此类攻击。我们的研究强调,随着视觉语言模型在医疗和金融等敏感领域的日益广泛应用,迫切需要加强隐私保护措施。补充材料中提供了更多实验。