Autonomous Large Language Model (LLM) agents exhibit significant vulnerability to Indirect Prompt Injection (IPI) attacks. These attacks hijack agent behavior by polluting external information sources, exploiting fundamental trade-offs between security and functionality in existing defense mechanisms. This leads to malicious and unauthorized tool invocations, diverting agents from their original objectives. The success of complex IPIs reveals a deeper systemic fragility: while current defenses demonstrate some effectiveness, most defense architectures are inherently fragmented. Consequently, they fail to provide full integrity assurance across the entire task execution pipeline, forcing unacceptable multi-dimensional compromises among security, functionality, and efficiency. Our method is predicated on a core insight: no matter how subtle an IPI attack, its pursuit of a malicious objective will ultimately manifest as a detectable deviation in the action trajectory, distinct from the expected legitimate plan. Based on this, we propose the Cognitive Control Architecture (CCA), a holistic framework achieving full-lifecycle cognitive supervision. CCA constructs an efficient, dual-layered defense system through two synergistic pillars: (i) proactive and preemptive control-flow and data-flow integrity enforcement via a pre-generated "Intent Graph"; and (ii) an innovative "Tiered Adjudicator" that, upon deviation detection, initiates deep reasoning based on multi-dimensional scoring, specifically designed to counter complex conditional attacks. Experiments on the AgentDojo benchmark substantiate that CCA not only effectively withstands sophisticated attacks that challenge other advanced defense methods but also achieves uncompromised security with notable efficiency and robustness, thereby reconciling the aforementioned multi-dimensional trade-off.
翻译:自主大型语言模型(LLM)智能体在间接提示注入(IPI)攻击面前表现出显著脆弱性。此类攻击通过污染外部信息源劫持智能体行为,利用了现有防御机制中安全性与功能性之间的根本性权衡。这导致恶意且未经授权的工具调用,使智能体偏离其原始目标。复杂IPI攻击的成功揭示了一个更深层的系统性缺陷:尽管当前防御措施展现出一定有效性,但多数防御架构本质上是碎片化的。因此,它们无法在整个任务执行流水线中提供完整的完整性保障,迫使系统在安全性、功能性与效率之间做出不可接受的多维妥协。我们的方法基于一个核心洞见:无论IPI攻击如何隐蔽,其对恶意目标的追求最终都会在行动轨迹中表现为可检测的偏离,这种偏离与预期的合法计划截然不同。基于此,我们提出认知控制架构(CCA),这是一个实现全生命周期认知监督的整体性框架。CCA通过两个协同支柱构建高效的双层防御系统:(i)通过预生成的“意图图”实施主动与先发制人的控制流与数据流完整性强化;(ii)创新的“分层裁决器”,在检测到偏离时基于多维评分启动深度推理,专门设计用于对抗复杂的条件性攻击。在AgentDojo基准测试上的实验证实,CCA不仅能有效抵御挑战其他先进防御方法的复杂攻击,还能以显著的效率与鲁棒性实现无妥协的安全性,从而调和上述多维权衡。