Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling applications across fields beyond healthcare, software engineering, and conversational systems. Despite these advancements in the past few years, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available defense strategies. We roughly categorize attack approaches into prompt-based, model-based, multimodal, and multilingual, covering techniques such as adversarial prompting, backdoor injections, and cross-modality exploits. We also review various defense mechanisms, including prompt filtering, transformation, alignment techniques, multi-agent defenses, and self-regulation, evaluating their strengths and shortcomings. We also discuss key metrics and benchmarks used to assess LLM safety and robustness, noting challenges like the quantification of attack success in interactive contexts and biases in existing datasets. Identifying current research gaps, we suggest future directions for resilient alignment strategies, advanced defenses against evolving attacks, automation of jailbreak detection, and consideration of ethical and societal impacts. This review emphasizes the need for continued research and cooperation within the AI community to enhance LLM security and ensure their safe deployment.
翻译:大型语言模型(LLMs)通过提升自然语言理解与生成能力,推动了人工智能的变革,其应用已扩展至医疗健康、软件工程、对话系统等多个领域。尽管近年来取得了显著进展,但LLMs仍表现出明显的脆弱性,尤其易受提示注入和越狱攻击的影响。本文综述了针对这些漏洞的研究现状,并梳理了现有的防御策略。我们将攻击方法大致分为基于提示的、基于模型的、多模态的以及多语言的攻击,涵盖对抗性提示、后门注入、跨模态利用等技术。同时,我们回顾了多种防御机制,包括提示过滤、提示转换、对齐技术、多智能体防御及自我调节方法,并评估了它们的优势与不足。此外,我们讨论了用于评估LLM安全性与鲁棒性的关键指标与基准测试,指出了当前面临的挑战,例如交互场景中攻击成功率的量化问题以及现有数据集中存在的偏差。通过识别当前的研究空白,我们提出了未来研究方向:包括开发更具韧性的对齐策略、针对演进攻击的先进防御技术、自动化越狱检测方法,以及考量伦理与社会影响。本综述强调,人工智能领域需持续开展研究与合作,以加强LLM的安全性,确保其安全部署。