Backdoor attacks pose a significant threat to Large Language Models (LLMs), where adversaries can embed hidden triggers to manipulate LLM's outputs. Most existing defense methods, primarily designed for classification tasks, are ineffective against the autoregressive nature and vast output space of LLMs, thereby suffering from poor performance and high latency. To address these limitations, we investigate the behavioral discrepancies between benign and backdoored LLMs in output space. We identify a critical phenomenon which we term sequence lock: a backdoored model generates the target sequence with abnormally high and consistent confidence compared to benign generation. Building on this insight, we propose ConfGuard, a lightweight and effective detection method that monitors a sliding window of token confidences to identify sequence lock. Extensive experiments demonstrate ConfGuard achieves a near 100\% true positive rate (TPR) and a negligible false positive rate (FPR) in the vast majority of cases. Crucially, the ConfGuard enables real-time detection almost without additional latency, making it a practical backdoor defense for real-world LLM deployments.
翻译:后门攻击对大型语言模型(LLMs)构成严重威胁,攻击者可通过嵌入隐藏触发器来操控LLM的输出。现有防御方法主要针对分类任务设计,难以应对LLMs的自回归特性和庞大输出空间,导致性能低下且延迟较高。为克服这些局限,我们研究了良性LLM与后门LLM在输出空间中的行为差异。我们发现了一个关键现象,称之为序列锁定:后门模型生成目标序列时,其置信度相比良性生成表现出异常高且持续稳定的特性。基于此洞察,我们提出了ConfGuard,一种轻量级高效检测方法,通过监控令牌置信度的滑动窗口来识别序列锁定现象。大量实验表明,ConfGuard在绝大多数情况下实现了接近100%的真阳性率(TPR)和可忽略的假阳性率(FPR)。关键的是,ConfGuard几乎无需额外延迟即可实现实时检测,使其成为实际LLM部署中可行的后门防御方案。