Large language models (LLMs) are evolving into agentic systems that reason, plan, and operate external tools. The Model Context Protocol (MCP) is a key enabler of this transition, offering a standardized interface for connecting LLMs with heterogeneous tools and services. Yet MCP's openness and multi-server workflows introduce new safety risks that existing benchmarks fail to capture, as they focus on isolated attacks or lack real-world coverage. We present MCP-SafetyBench, a comprehensive benchmark built on real MCP servers that supports realistic multi-turn evaluation across five domains: browser automation, financial analysis, location navigation, repository management, and web search. It incorporates a unified taxonomy of 20 MCP attack types spanning server, host, and user sides, and includes tasks requiring multi-step reasoning and cross-server coordination under uncertainty. Using MCP-SafetyBench, we systematically evaluate leading open- and closed-source LLMs, revealing large disparities in safety performance and escalating vulnerabilities as task horizons and server interactions grow. Our results highlight the urgent need for stronger defenses and establish MCP-SafetyBench as a foundation for diagnosing and mitigating safety risks in real-world MCP deployments.
翻译:大型语言模型(LLMs)正逐步演化为具备推理、规划及操作外部工具能力的智能体系统。模型上下文协议(MCP)是实现这一转型的关键赋能技术,它为LLMs与异构工具及服务提供了标准化接口。然而,MCP的开放性和多服务器工作流引入了现有基准测试未能涵盖的新型安全风险——这些基准往往聚焦于孤立攻击场景或缺乏真实世界覆盖。本文提出MCP-SafetyBench:一个基于真实MCP服务器构建的综合基准测试框架,支持跨浏览器自动化、金融分析、位置导航、仓库管理与网络搜索五大领域的真实多轮次评估。该基准整合了涵盖服务器端、主机端与用户端的20种MCP攻击类型统一分类体系,并包含需在不确定条件下进行多步推理及跨服务器协同的任务。通过MCP-SafetyBench,我们系统评估了主流开源与闭源LLMs,揭示了安全性能的显著差异以及随任务跨度与服务器交互增加而升级的脆弱性。研究结果凸显了强化防御机制的迫切需求,同时确立了MCP-SafetyBench作为诊断和缓解真实世界MCP部署安全风险的基础性工具。