Large language models (LLMs) increasingly employ guardrails to enforce ethical, legal, and application-specific constraints on their outputs. While effective at mitigating harmful responses, these guardrails introduce a new class of vulnerabilities by exposing observable decision patterns. In this work, we present the first study of black-box LLM guardrail reverse-engineering attacks. We propose Guardrail Reverse-engineering Attack (GRA), a reinforcement learning-based framework that leverages genetic algorithm-driven data augmentation to approximate the decision-making policy of victim guardrails. By iteratively collecting input-output pairs, prioritizing divergence cases, and applying targeted mutations and crossovers, our method incrementally converges toward a high-fidelity surrogate of the victim guardrail. We evaluate GRA on three widely deployed commercial systems, namely ChatGPT, DeepSeek, and Qwen3, and demonstrate that it achieves an rule matching rate exceeding 0.92 while requiring less than $85 in API costs. These findings underscore the practical feasibility of guardrail extraction and highlight significant security risks for current LLM safety mechanisms. Our findings expose critical vulnerabilities in current guardrail designs and highlight the urgent need for more robust defense mechanisms in LLM deployment.
翻译:大型语言模型(LLMs)越来越多地采用护栏机制,以在其输出中强制执行伦理、法律和应用特定的约束。尽管这些护栏在减轻有害响应方面效果显著,但它们通过暴露可观测的决策模式引入了一类新的漏洞。在本研究中,我们首次对黑盒LLM护栏逆向工程攻击进行了系统性探讨。我们提出了护栏逆向工程攻击(GRA),这是一个基于强化学习的框架,利用遗传算法驱动的数据增强技术来近似受害者护栏的决策策略。通过迭代收集输入-输出对、优先处理分歧案例,并应用定向突变和交叉操作,我们的方法逐步收敛到受害者护栏的高保真度替代模型。我们在三个广泛部署的商业系统(即ChatGPT、DeepSeek和Qwen3)上评估了GRA,结果表明,该方法在API成本低于85美元的情况下,规则匹配率超过0.92。这些发现揭示了护栏提取的实际可行性,并凸显了当前LLM安全机制面临的重大安全风险。我们的研究暴露了当前护栏设计中的关键漏洞,强调了在LLM部署中迫切需要更强大的防御机制。