大语言模型在代码混合扰动下的归因性安全失效 (Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations)

While LLMs appear robustly safety-aligned in English, we uncover a catastrophic, overlooked weakness: attributional collapse under code-mixed perturbations. Our systematic evaluation of open models shows that the linguistic camouflage of code-mixing -- ``blending languages within a single conversation'' -- can cause safety guardrails to fail dramatically. Attack success rates (ASR) spike from a benign 9\% in monolingual English to 69\% under code-mixed inputs, with rates exceeding 90\% in non-Western contexts such as Arabic and Hindi. These effects hold not only on controlled synthetic datasets but also on real-world social media traces, revealing a serious risk for billions of users. To explain why this happens, we introduce saliency drift attribution (SDA), an interpretability framework that shows how, under code-mixing, the model's internal attention drifts away from safety-critical tokens (e.g., ``violence'' or ``corruption''), effectively blinding it to harmful intent. Finally, we propose a lightweight translation-based restoration strategy that recovers roughly 80\% of the safety lost to code-mixing, offering a practical path toward more equitable and robust LLM safety.

翻译：尽管大语言模型在英语环境下表现出稳健的安全对齐性，但我们揭示了一个灾难性且被忽视的弱点：在代码混合扰动下的归因性崩溃。我们对开源模型的系统评估表明，代码混合的语言伪装——即“在单一对话中混合多种语言”——可导致安全防护机制显著失效。攻击成功率从单语英语环境下的良性9%飙升至代码混合输入下的69%，在非西方语境（如阿拉伯语和印地语）中甚至超过90%。这些效应不仅出现在受控合成数据集上，也存在于真实世界社交媒体痕迹中，揭示了数十亿用户面临的严重风险。为解释此现象，我们提出了显著性漂移归因框架——一种可解释性方法，其显示在代码混合条件下，模型内部注意力会从安全关键令牌（如“暴力”或“腐败”）上漂移，实质上使其对有害意图“失明”。最后，我们提出一种轻量级的基于翻译的恢复策略，可挽回约80%因代码混合损失的安全性，为构建更公平、更鲁棒的大语言模型安全提供了实用路径。