We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of ~8,000 hosts across 12 subnets. ARTEMIS is a multi-agent framework featuring dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. We observe that AI agents offer advantages in systematic enumeration, parallel exploitation, and cost -- certain ARTEMIS variants cost $18/hour versus $60/hour for professional penetration testers. We also identify key capability gaps: AI agents exhibit higher false-positive rates and struggle with GUI-based tasks.
翻译:我们首次在真实的企业环境中对人工智能代理与人类网络安全专业人员进行了全面评估。我们在一个包含约8000台主机、跨越12个子网的大型大学网络环境中,评估了十名网络安全专业人员、六个现有AI代理以及我们新开发的代理框架ARTEMIS。ARTEMIS是一个多代理框架,具备动态提示生成、任意子代理创建和自动漏洞分级功能。在对比研究中,ARTEMIS综合排名第二,发现了9个有效漏洞,有效提交率达到82%,表现优于10名人类参与者中的9位。虽然现有框架如Codex和CyAgent的表现低于大多数人类参与者,但ARTEMIS展现出的技术复杂性和提交质量与最优秀的人类参与者相当。我们观察到AI代理在系统化枚举、并行利用和成本方面具有优势——某些ARTEMIS变体的运行成本为每小时18美元,而专业渗透测试人员的成本为每小时60美元。同时我们也识别出关键的能力差距:AI代理表现出更高的误报率,且在基于图形用户界面的任务处理上存在困难。