Deepfake speech detection presents a growing challenge as generative audio technologies continue to advance. We propose a hybrid training framework that advances detection performance through novel augmentation strategies. First, we introduce a dual-stage masking approach that operates both at the spectrogram level (MaskedSpec) and within the latent feature space (MaskedFeature), providing complementary regularization that improves tolerance to localized distortions and enhances generalization learning. Second, we introduce compression-aware strategy during self-supervised to increase variability in low-resource scenarios while preserving the integrity of learned representations, thereby improving the suitability of pretrained features for deepfake detection. The framework integrates a learnable self-supervised feature extractor with a ResNet classification head in a unified training pipeline, enabling joint adaptation of acoustic representations and discriminative patterns. On the ASVspoof5 Challenge (Track~1), the system achieves state-of-the-art results with an Equal Error Rate (EER) of 4.08% under closed conditions, further reduced to 2.71% through fusion of models with diverse pretrained feature extractors. when trained on ASVspoof2019, our system obtaining leading performance on the ASVspoof2019 evaluation set (0.18% EER) and the ASVspoof2021 DF task (2.92% EER).
翻译:暂无翻译