QuadSentinel：多智能体系统中可机器验证控制的序列安全性 (QuadSentinel: Sequent Safety for Machine-Checkable Control in Multi-agent Systems)

Safety risks arise as large language model-based agents solve complex tasks with tools, multi-step plans, and inter-agent messages. However, deployer-written policies in natural language are ambiguous and context dependent, so they map poorly to machine-checkable rules, and runtime enforcement is unreliable. Expressing safety policies as sequents, we propose \textsc{QuadSentinel}, a four-agent guard (state tracker, policy verifier, threat watcher, and referee) that compiles these policies into machine-checkable rules built from predicates over observable state and enforces them online. Referee logic plus an efficient top-$k$ predicate updater keeps costs low by prioritizing checks and resolving conflicts hierarchically. Measured on ST-WebAgentBench (ICML CUA~'25) and AgentHarm (ICLR~'25), \textsc{QuadSentinel} improves guardrail accuracy and rule recall while reducing false positives. Against single-agent baselines such as ShieldAgent (ICML~'25), it yields better overall safety control. Near-term deployments can adopt this pattern without modifying core agents by keeping policies separate and machine-checkable. Our code will be made publicly available at https://github.com/yyiliu/QuadSentinel.

翻译：随着基于大语言模型的智能体通过工具、多步规划和智能体间消息交互解决复杂任务，安全风险随之产生。然而，部署者用自然语言编写的策略具有模糊性和上下文依赖性，难以映射为可机器验证的规则，导致运行时执行不可靠。通过将安全策略表达为序列式逻辑，我们提出\\textsc{QuadSentinel}——一种由四个智能体（状态跟踪器、策略验证器、威胁监视器和仲裁器）构成的防护机制，能够将这些策略编译为基于可观测状态谓词构建的可机器验证规则，并在线执行。仲裁器逻辑结合高效的top-$k$谓词更新器，通过优先级检查与分层冲突消解保持较低开销。在ST-WebAgentBench（ICML CUA~'25）和AgentHarm（ICLR~'25）上的实验表明，\\textsc{QuadSentinel}在提升防护准确率与规则召回率的同时降低了误报率。相较于ShieldAgent（ICML~'25）等单智能体基线方法，本方案实现了更优的整体安全控制。近期部署可采用此模式，通过保持策略独立且可机器验证的特性，无需修改核心智能体。代码将公开于https://github.com/yyiliu/QuadSentinel。