Foundation models are increasingly becoming better autonomous programmers, raising the prospect that they could also automate dangerous offensive cyber-operations. Current frontier model audits probe the cybersecurity risks of such agents, but most fail to account for the degrees of freedom available to adversaries in the real world. In particular, with strong verifiers and financial incentives, agents for offensive cybersecurity are amenable to iterative improvement by would-be adversaries. We argue that assessments should take into account an expanded threat model in the context of cybersecurity, emphasizing the varying degrees of freedom that an adversary may possess in stateful and non-stateful environments within a fixed compute budget. We show that even with a relatively small compute budget (8 H100 GPU Hours in our study), adversaries can improve an agent's cybersecurity capability on InterCode CTF by more than 40\% relative to the baseline -- without any external assistance. These results highlight the need to evaluate agents' cybersecurity risk in a dynamic manner, painting a more representative picture of risk.
翻译:基础模型正日益成为更优秀的自主编程工具,这引发了它们可能自动化危险攻击性网络作战的前景。当前的前沿模型审计探究了此类智能体的网络安全风险,但大多未能考虑现实世界中对手可利用的自由度。特别是在具备强大验证器和经济激励的情况下,攻击性网络安全智能体易受到潜在对手的迭代改进。我们认为,在网络安全背景下,评估应考虑扩展的威胁模型,强调对手在固定计算预算内,于有状态和无状态环境中可能拥有的不同程度自由度。研究表明,即使使用相对较小的计算预算(本研究中为8个H100 GPU小时),对手也能将智能体在InterCode CTF上的网络安全能力相对于基线提升超过40%——且无需任何外部协助。这些结果凸显了以动态方式评估智能体网络安全风险的必要性,从而描绘出更具代表性的风险图景。