Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ``Deceitful'' and ``Manipulative'', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.
翻译:大语言模型(LLMs)越来越多地被用于创造性生成任务,包括模拟虚构角色。然而,它们在刻画非亲社会的、敌对性角色方面的能力在很大程度上尚未得到检验。我们假设,现代LLMs的安全对齐机制与真实扮演道德模糊或反派角色的任务之间存在根本性冲突。为探究此问题,我们引入了Moral RolePlay基准,这是一个包含四级道德对齐量表及平衡测试集的新数据集,用于严格评估。我们要求最先进的LLMs扮演从道德典范到纯粹反派的一系列角色。我们的大规模评估显示,随着角色道德水平的降低,角色扮演的保真度呈现出一致的、单调的下降趋势。我们发现,模型在与安全原则直接对立的特质(如“欺骗性”和“操纵性”)上表现最为困难,常常用表面的攻击性替代细腻的恶意。此外,我们证明,通用聊天机器人的熟练度是反派角色扮演能力的较差预测指标,高度安全对齐的模型表现尤其不佳。我们的工作首次为这一关键局限性提供了系统性证据,突显了模型安全性与创作保真度之间的核心张力。我们的基准和发现为开发更细致、上下文感知的对齐方法铺平了道路。