As LLMs become more common, non-expert users can pose risks, prompting extensive research into jailbreak attacks. However, most existing black-box jailbreak attacks rely on hand-crafted heuristics or narrow search spaces, which limit scalability. Compared with prior attacks, we propose Game-Theory Attack (GTA), an scalable black-box jailbreak framework. Concretely, we formalize the attacker's interaction against safety-aligned LLMs as a finite-horizon, early-stoppable sequential stochastic game, and reparameterize the LLM's randomized outputs via quantal response. Building on this, we introduce a behavioral conjecture "template-over-safety flip": by reshaping the LLM's effective objective through game-theoretic scenarios, the originally safety preference may become maximizing scenario payoffs within the template, which weakens safety constraints in specific contexts. We validate this mechanism with classical game such as the disclosure variant of the Prisoner's Dilemma, and we further introduce an Attacker Agent that adaptively escalates pressure to increase the ASR. Experiments across multiple protocols and datasets show that GTA achieves over 95% ASR on LLMs such as Deepseek-R1, while maintaining efficiency. Ablations over components, decoding, multilingual settings, and the Agent's core model confirm effectiveness and generalization. Moreover, scenario scaling studies further establish scalability. GTA also attains high ASR on other game-theoretic scenarios, and one-shot LLM-generated variants that keep the model mechanism fixed while varying background achieve comparable ASR. Paired with a Harmful-Words Detection Agent that performs word-level insertions, GTA maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world LLM applications and reports a longitudinal safety monitoring of popular HuggingFace LLMs.
翻译:随着大型语言模型(LLMs)的普及,非专业用户可能带来风险,这促使了对越狱攻击的广泛研究。然而,现有的大多数黑盒越狱攻击依赖于手工设计的启发式方法或狭窄的搜索空间,限制了其可扩展性。与先前攻击相比,我们提出了博弈论攻击(GTA),一种可扩展的黑盒越狱框架。具体而言,我们将攻击者与安全对齐的LLMs的交互形式化为一个有限时域、可提前终止的序贯随机博弈,并通过量化响应对LLM的随机化输出进行重新参数化。在此基础上,我们引入了一个行为猜想“模板优先于安全翻转”:通过博弈论场景重塑LLM的有效目标,原本的安全偏好可能转变为在模板内最大化场景收益,从而在特定上下文中削弱安全约束。我们使用经典博弈(如囚徒困境的披露变体)验证了这一机制,并进一步引入一个攻击者智能体,自适应地施加压力以提高攻击成功率(ASR)。在多个协议和数据集上的实验表明,GTA在Deepseek-R1等LLMs上实现了超过95%的ASR,同时保持了效率。对组件、解码、多语言设置及智能体核心模型的消融实验证实了其有效性和泛化能力。此外,场景扩展研究进一步确立了可扩展性。GTA在其他博弈论场景上也取得了高ASR,而保持模型机制不变、仅变换背景的单次LLM生成变体也实现了相当的ASR。结合执行词级插入的有害词检测智能体,GTA在保持高ASR的同时,降低了在提示防护模型下的检测率。除基准测试外,GTA成功越狱了实际LLM应用,并报告了对流行HuggingFace LLMs的纵向安全监测结果。