Open-weight models provide researchers and developers with accessible foundations for diverse downstream applications. We tested the safety and security postures of eight open-weight large language models (LLMs) to identify vulnerabilities that may impact subsequent fine-tuning and deployment. Using automated adversarial testing, we measured each model's resilience against single-turn and multi-turn prompt injection and jailbreak attacks. Our findings reveal pervasive vulnerabilities across all tested models, with multi-turn attacks achieving success rates between 25.86\% and 92.78\% -- representing a $2\times$ to $10\times$ increase over single-turn baselines. These results underscore a systemic inability of current open-weight models to maintain safety guardrails across extended interactions. We assess that alignment strategies and lab priorities significantly influence resilience: capability-focused models such as Llama 3.3 and Qwen 3 demonstrate higher multi-turn susceptibility, whereas safety-oriented designs such as Google Gemma 3 exhibit more balanced performance. The analysis concludes that open-weight models, while crucial for innovation, pose tangible operational and ethical risks when deployed without layered security controls. These findings are intended to inform practitioners and developers of the potential risks and the value of professional AI security solutions to mitigate exposure. Addressing multi-turn vulnerabilities is essential to ensure the safe, reliable, and responsible deployment of open-weight LLMs in enterprise and public domains. We recommend adopting a security-first design philosophy and layered protections to ensure resilient deployments of open-weight models.
翻译:开源权重模型为研究人员和开发者提供了可访问的基础架构,以支持多样化的下游应用。本研究测试了八个开源权重大语言模型(LLMs)的安全与防护态势,旨在识别可能影响后续微调与部署的脆弱性。通过自动化对抗测试,我们评估了各模型对单轮和多轮提示注入及越狱攻击的抵御能力。研究结果显示所有被测模型均存在普遍性漏洞,其中多轮攻击的成功率介于25.86%至92.78%之间——较单轮基线提升2至10倍。这些结果揭示了当前开源权重模型在持续交互中维持安全防护机制的系统性缺陷。我们评估发现,对齐策略与实验室优先级显著影响模型韧性:以能力为核心的模型(如Llama 3.3和Qwen 3)表现出更高的多轮攻击易感性,而侧重安全性的设计(如Google Gemma 3)则呈现更均衡的性能表现。分析结论指出,开源权重模型虽对创新至关重要,但在缺乏分层安全控制的情况下部署会带来切实的操作与伦理风险。本研究旨在提醒从业者与开发者关注潜在风险,并理解专业AI安全解决方案对降低暴露风险的价值。解决多轮交互漏洞对于确保开源权重LLMs在企业及公共领域实现安全、可靠、负责任的部署至关重要。我们建议采用安全优先的设计理念与分层防护机制,以保障开源权重模型的韧性部署。