Backdoor attacks create significant security threats to language models by embedding hidden triggers that manipulate model behavior during inference, presenting critical risks for AI systems deployed in healthcare and other sensitive domains. While existing defenses effectively counter obvious threats such as out-of-context trigger words and safety alignment violations, they fail against sophisticated attacks using contextually-appropriate triggers that blend seamlessly into natural language. This paper introduces three novel contextually-aware attack scenarios that exploit domain-specific knowledge and semantic plausibility: the ViralApp attack targeting social media addiction classification, the Fever attack manipulating medical diagnosis toward hypertension, and the Referral attack steering clinical recommendations. These attacks represent realistic threats where malicious actors exploit domain-specific vocabulary while maintaining semantic coherence, demonstrating how adversaries can weaponize contextual appropriateness to evade conventional detection methods. To counter both traditional and these sophisticated attacks, we present \textbf{SCOUT (Saliency-based Classification Of Untrusted Tokens)}, a novel defense framework that identifies backdoor triggers through token-level saliency analysis rather than traditional context-based detection methods. SCOUT constructs a saliency map by measuring how the removal of individual tokens affects the model's output logits for the target label, enabling detection of both conspicuous and subtle manipulation attempts. We evaluate SCOUT on established benchmark datasets (SST-2, IMDB, AG News) against conventional attacks (BadNet, AddSent, SynBkd, StyleBkd) and our novel attacks, demonstrating that SCOUT successfully detects these sophisticated threats while preserving accuracy on clean inputs.
翻译:后门攻击通过嵌入隐藏触发器在推理阶段操纵模型行为,对语言模型构成重大安全威胁,对部署于医疗等敏感领域的AI系统带来关键风险。现有防御方法虽能有效应对上下文无关触发词和安全对齐违规等明显威胁,但无法抵御利用上下文适配触发器、与自然语言无缝融合的复杂攻击。本文提出三种新颖的上下文感知攻击场景,利用领域特定知识与语义合理性:针对社交媒体成瘾分类的ViralApp攻击、操纵医疗诊断指向高血压的Fever攻击,以及引导临床建议的Referral攻击。这些攻击呈现了现实威胁场景——恶意行为者利用领域特定词汇同时保持语义连贯性,展示了攻击者如何利用上下文适配性规避传统检测方法。为应对传统及此类复杂攻击,我们提出\\textbf{SCOUT(基于显著性的不可信令牌分类)},一种通过令牌级显著性分析而非传统上下文检测方法来识别后门触发器的创新防御框架。SCOUT通过度量移除单个令牌对模型目标标签输出逻辑的影响构建显著性图谱,既能检测明显也能发现隐蔽的操纵企图。我们在经典基准数据集(SST-2、IMDB、AG News)上评估SCOUT对抗传统攻击(BadNet、AddSent、SynBkd、StyleBkd)及新型攻击的效果,证明SCOUT在保持干净输入准确性的同时,能成功检测这些复杂威胁。