Large Language Models (LLMs) remain susceptible to jailbreak exploits that bypass safety filters and induce harmful or unethical behavior. This work presents a systematic taxonomy of existing jailbreak defenses across prompt-level, model-level, and training-time interventions, followed by three proposed defense strategies. First, a Prompt-Level Defense Framework detects and neutralizes adversarial inputs through sanitization, paraphrasing, and adaptive system guarding. Second, a Logit-Based Steering Defense reinforces refusal behavior through inference-time vector steering in safety-sensitive layers. Third, a Domain-Specific Agent Defense employs the MetaGPT framework to enforce structured, role-based collaboration and domain adherence. Experiments on benchmark datasets show substantial reductions in attack success rate, achieving full mitigation under the agent-based defense. Overall, this study highlights how jailbreaks pose a significant security threat to LLMs and identifies key intervention points for prevention, while noting that defense strategies often involve trade-offs between safety, performance, and scalability. Code is available at: https://github.com/Kuro0911/CS5446-Project
翻译:大语言模型(LLMs)仍易受越狱攻击影响,此类攻击可绕过安全过滤器并诱导模型产生有害或不道德行为。本研究系统性地对现有越狱防御方法进行了分类,涵盖提示层面、模型层面及训练阶段的干预措施,并提出了三种防御策略。首先,提出一种提示层面防御框架,通过输入净化、语义重构和自适应系统防护来检测并消除对抗性输入。其次,提出一种基于对数概率的导向防御,通过在安全敏感层进行推理时向量导向来强化模型的拒绝行为。第三,提出一种领域特定智能体防御,采用MetaGPT框架以强制实现基于角色的结构化协作与领域约束。在基准数据集上的实验表明,攻击成功率显著降低,基于智能体的防御实现了完全缓解。总体而言,本研究揭示了越狱攻击对大语言模型构成重大安全威胁,并指出了关键防御干预点,同时强调防御策略往往需要在安全性、性能与可扩展性之间进行权衡。代码发布于:https://github.com/Kuro0911/CS5446-Project