Safety risks arise as large language model-based agents solve complex tasks with tools, multi-step plans, and inter-agent messages. However, deployer-written policies in natural language are ambiguous and context dependent, so they map poorly to machine-checkable rules, and runtime enforcement is unreliable. Expressing safety policies as sequents, we propose \textsc{QuadSentinel}, a four-agent guard (state tracker, policy verifier, threat watcher, and referee) that compiles these policies into machine-checkable rules built from predicates over observable state and enforces them online. Referee logic plus an efficient top-$k$ predicate updater keeps costs low by prioritizing checks and resolving conflicts hierarchically. Measured on ST-WebAgentBench (ICML CUA~'25) and AgentHarm (ICLR~'25), \textsc{QuadSentinel} improves guardrail accuracy and rule recall while reducing false positives. Against single-agent baselines such as ShieldAgent (ICML~'25), it yields better overall safety control. Near-term deployments can adopt this pattern without modifying core agents by keeping policies separate and machine-checkable. Our code will be made publicly available at https://github.com/yyiliu/QuadSentinel.
翻译:随着基于大语言模型的智能体通过工具、多步规划和智能体间消息交互解决复杂任务,安全风险随之产生。然而,部署者用自然语言编写的策略具有模糊性和上下文依赖性,难以映射为可机器验证的规则,导致运行时执行不可靠。通过将安全策略表达为序列式逻辑,我们提出\\textsc{QuadSentinel}——一种由四个智能体(状态跟踪器、策略验证器、威胁监视器和仲裁器)构成的防护机制,能够将这些策略编译为基于可观测状态谓词构建的可机器验证规则,并在线执行。仲裁器逻辑结合高效的top-$k$谓词更新器,通过优先级检查与分层冲突消解保持较低开销。在ST-WebAgentBench(ICML CUA~'25)和AgentHarm(ICLR~'25)上的实验表明,\\textsc{QuadSentinel}在提升防护准确率与规则召回率的同时降低了误报率。相较于ShieldAgent(ICML~'25)等单智能体基线方法,本方案实现了更优的整体安全控制。近期部署可采用此模式,通过保持策略独立且可机器验证的特性,无需修改核心智能体。代码将公开于https://github.com/yyiliu/QuadSentinel。