Self-supervised representations excel at many vision and speech tasks, but their potential for audio-visual deepfake detection remains underexplored. Unlike prior work that uses these features in isolation or buried within complex architectures, we systematically evaluate them across modalities (audio, video, multimodal) and domains (lip movements, generic visual content). We assess three key dimensions: detection effectiveness, interpretability of encoded information, and cross-modal complementarity. We find that most self-supervised features capture deepfake-relevant information, and that this information is complementary. Moreover, models primarily attend to semantically meaningful regions rather than spurious artifacts. Yet none generalize reliably across datasets. This generalization failure likely stems from dataset characteristics, not from the features themselves latching onto superficial patterns. These results expose both the promise and fundamental challenges of self-supervised representations for deepfake detection: while they learn meaningful patterns, achieving robust cross-domain performance remains elusive.
翻译:自监督表征在众多视觉与语音任务中表现卓越,但其在音视频深度伪造检测中的潜力尚未得到充分探索。与先前研究仅孤立使用这些特征或将其嵌入复杂架构不同,我们系统性地评估了它们在多模态(音频、视频、多模态)与多领域(唇部运动、通用视觉内容)中的表现。我们从三个关键维度进行评估:检测效能、编码信息的可解释性以及跨模态互补性。研究发现,多数自监督特征能够捕捉与深度伪造相关的信息,且这些信息具有互补性。此外,模型主要关注语义相关区域而非虚假伪影。然而,所有模型均无法在不同数据集间实现可靠泛化。这种泛化失败可能源于数据集特性,而非特征本身对表层模式的依赖。这些结果揭示了自监督表征在深度伪造检测中的潜力与根本性挑战:尽管它们能够学习有意义的模式,但实现稳健的跨域性能仍面临困难。