Large Language Models remain vulnerable to adversarial prompts that elicit toxic content even after safety alignment. We present ToxSearch, a black-box evolutionary framework that tests model safety by evolving prompts in a synchronous steady-state loop. The system employs a diverse set of operators, including lexical substitutions, negation, back-translation, paraphrasing, and two semantic crossover operators, while a moderation oracle provides fitness guidance. Operator-level analysis shows heterogeneous behavior: lexical substitutions offer the best yield-variance trade-off, semantic-similarity crossover acts as a precise low-throughput inserter, and global rewrites exhibit high variance with elevated refusal costs. Using elite prompts evolved on LLaMA 3.1 8B, we observe practically meaningful but attenuated cross-model transfer, with toxicity roughly halving on most targets, smaller LLaMA 3.2 variants showing the strongest resistance, and some cross-architecture models retaining higher toxicity. These results suggest that small, controllable perturbations are effective vehicles for systematic red-teaming and that defenses should anticipate cross-model reuse of adversarial prompts rather than focusing only on single-model hardening.
翻译:大型语言模型在完成安全对齐后,仍易受对抗性提示的影响而生成有害内容。本文提出ToxSearch,一种黑盒演化框架,通过在同步稳态循环中演化提示来测试模型安全性。该系统采用多样化的操作算子,包括词汇替换、否定、回译、释义以及两种语义交叉算子,同时通过审核预言机提供适应度指导。算子层面的分析揭示了异质性行为:词汇替换在产出与方差之间取得最佳平衡,语义相似性交叉算子充当精确的低通量插入器,而全局重写则表现出高方差并伴随较高的拒绝成本。利用在LLaMA 3.1 8B上演化得到的精英提示进行测试,我们观察到具有实际意义但有所衰减的跨模型迁移效应——多数目标模型的毒性输出约减少一半,其中较小规模的LLaMA 3.2变体表现出最强的抵抗能力,而某些跨架构模型仍保持较高的毒性水平。这些结果表明,微小且可控的扰动是系统性红队测试的有效载体,防御机制应当预见到对抗性提示的跨模型复用风险,而非仅专注于单一模型的加固。