Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios, with each tier introducing more complex and ambiguous scenarios. We first demonstrate that Behavior Editing can dynamically steer agents toward the target behavior within specific scenarios. Moreover, Behavior Editing enables not only scenario-specific local adjustments but also more extensive shifts in an agent's global moral alignment. We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior. Through extensive evaluations of agents built on frontier LLMs, BehaviorBench validates the effectiveness of behavior editing across a wide range of models and scenarios. Our findings offer key insights into a new paradigm for steering agent behavior, highlighting both the promise and perils of Behavior Editing.
翻译:基于大型语言模型(LLMs)的智能体已在广泛任务中展现出强大能力。然而,在高风险领域部署LLM智能体伴随着显著的安全与伦理风险。这些智能体的不道德行为可能直接导致严重的现实后果,包括人身伤害与经济损失。为有效引导智能体的伦理行为,我们将智能体行为引导构建为模型编辑任务,称之为行为编辑。模型编辑作为新兴研究领域,能够在保持LLMs整体能力的同时实现精准高效的模型修改。为系统研究并评估该方法,我们提出了基于心理学道德理论的多层级基准测试集BehaviorBench。该基准支持跨多种场景的智能体行为评估与编辑,每个层级均引入更复杂且具模糊性的场景。我们首先证明行为编辑能在特定场景中动态引导智能体趋向目标行为。此外,行为编辑不仅支持场景特定的局部调整,更能实现智能体全局道德倾向的广泛转变。我们论证了行为编辑既可被用于促进伦理与善行,亦可反向诱导伤害性或恶意行为。通过对前沿LLMs构建的智能体进行广泛评估,BehaviorBench验证了行为编辑在多样化模型与场景中的有效性。本研究为智能体行为引导的新范式提供了关键洞见,同时揭示了行为编辑技术所蕴含的机遇与风险。