基于负责任人工智能考量的大语言模型越狱攻击防御研究 (Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations)

Large Language Models (LLMs) remain susceptible to jailbreak exploits that bypass safety filters and induce harmful or unethical behavior. This work presents a systematic taxonomy of existing jailbreak defenses across prompt-level, model-level, and training-time interventions, followed by three proposed defense strategies. First, a Prompt-Level Defense Framework detects and neutralizes adversarial inputs through sanitization, paraphrasing, and adaptive system guarding. Second, a Logit-Based Steering Defense reinforces refusal behavior through inference-time vector steering in safety-sensitive layers. Third, a Domain-Specific Agent Defense employs the MetaGPT framework to enforce structured, role-based collaboration and domain adherence. Experiments on benchmark datasets show substantial reductions in attack success rate, achieving full mitigation under the agent-based defense. Overall, this study highlights how jailbreaks pose a significant security threat to LLMs and identifies key intervention points for prevention, while noting that defense strategies often involve trade-offs between safety, performance, and scalability. Code is available at: https://github.com/Kuro0911/CS5446-Project

翻译：大语言模型（LLMs）仍易受越狱攻击影响，此类攻击可绕过安全过滤器并诱导模型产生有害或不道德行为。本研究系统性地对现有越狱防御方法进行了分类，涵盖提示层面、模型层面及训练阶段的干预措施，并提出了三种防御策略。首先，提出一种提示层面防御框架，通过输入净化、语义重构和自适应系统防护来检测并消除对抗性输入。其次，提出一种基于对数概率的导向防御，通过在安全敏感层进行推理时向量导向来强化模型的拒绝行为。第三，提出一种领域特定智能体防御，采用MetaGPT框架以强制实现基于角色的结构化协作与领域约束。在基准数据集上的实验表明，攻击成功率显著降低，基于智能体的防御实现了完全缓解。总体而言，本研究揭示了越狱攻击对大语言模型构成重大安全威胁，并指出了关键防御干预点，同时强调防御策略往往需要在安全性、性能与可扩展性之间进行权衡。代码发布于：https://github.com/Kuro0911/CS5446-Project