Large language models (LLMs) have become foundational in AI systems, yet they remain vulnerable to adversarial jailbreak attacks. These attacks involve carefully crafted prompts that bypass safety guardrails and induce models to produce harmful content. Detecting such malicious input queries is therefore critical for maintaining LLM safety. Existing methods for jailbreak detection typically involve fine-tuning LLMs as static safety LLMs using fixed training datasets. However, these methods incur substantial computational costs when updating model parameters to improve robustness, especially in the face of novel jailbreak attacks. Inspired by immunological memory mechanisms, we propose the Multi-Agent Adaptive Guard (MAAG) framework for jailbreak detection. The core idea is to equip guard with memory capabilities: upon encountering novel jailbreak attacks, the system memorizes attack patterns, enabling it to rapidly and accurately identify similar threats in future encounters. Specifically, MAAG first extracts activation values from input prompts and compares them to historical activations stored in a memory bank for quick preliminary detection. A defense agent then simulates responses based on these detection results, and an auxiliary agent supervises the simulation process to provide secondary filtering of the detection outcomes. Extensive experiments across five open-source models demonstrate that MAAG significantly outperforms state-of-the-art (SOTA) methods, achieving 98% detection accuracy and a 96% F1-score across a diverse range of attack scenarios.
翻译:大语言模型已成为人工智能系统的基石,但其仍易受对抗性越狱攻击的影响。这类攻击通过精心设计的提示词绕过安全防护机制,诱导模型生成有害内容。因此,检测此类恶意输入查询对于维护大语言模型的安全性至关重要。现有的越狱检测方法通常使用固定训练数据集对大语言模型进行微调,将其作为静态安全模型。然而,在面对新型越狱攻击时,为提升鲁棒性而更新模型参数会带来巨大的计算开销。受免疫记忆机制的启发,我们提出了一种用于越狱检测的多智能体自适应防护框架。其核心思想是为防护系统赋予记忆能力:当遭遇新型越狱攻击时,系统会记忆攻击模式,从而在未来遇到类似威胁时能够快速、准确地识别。具体而言,MAAG首先从输入提示中提取激活值,并将其与存储在记忆库中的历史激活值进行比较,以实现快速初步检测。随后,防御智能体基于这些检测结果模拟响应,而辅助智能体则监督模拟过程,对检测结果进行二次过滤。在五个开源模型上进行的大量实验表明,MAAG在多种攻击场景下显著优于现有最优方法,实现了98%的检测准确率和96%的F1分数。