As Large Language Models (LLMs) are increasingly deployed in real-world applications, it is important to ensure their behaviors align with human values, societal norms, and ethical principles. However, safety alignment under Reinforcement Learning (RL) often suffers from forgetting learned general abilities, which is also known as the alignment tax. To address this issue, we introduce Null-Space constrained Policy Optimization (NSPO), a novel RL framework for LLM safety alignment while preserving their core abilities. The safety policy gradients are geometrically projected into the null space of general tasks, thereby mitigating the safety alignment tax. In addition, we theoretically prove that NSPO preserves the model's original core capabilities, while still guaranteeing a descent direction for effective safety alignment. Extensive experiments demonstrate that NSPO outperforms existing methods by a large margin, achieving state-of-the-art safety performance without sacrificing accuracy on general tasks, including math, code, and instruction-following tasks. Notably, NSPO is data-efficient and only requires 40% of public human-annotated safety data from PKU-SafeRLHF to achieve promising safety performance, without a large amount of mixed general tasks data in existing alignment methods.
翻译:随着大语言模型(LLMs)在现实世界应用中的日益普及,确保其行为与人类价值观、社会规范和伦理原则保持一致至关重要。然而,在强化学习(RL)框架下进行安全对齐常导致模型遗忘已习得的通用能力,这一问题被称为对齐税。为解决此问题,我们提出了零空间约束策略优化(NSPO),这是一种新颖的强化学习框架,旨在实现LLMs的安全对齐同时保持其核心能力。该方法将安全策略梯度几何投影至通用任务的零空间中,从而有效缓解安全对齐税。此外,我们从理论上证明了NSPO能够保留模型的原始核心能力,同时仍确保沿有效安全对齐方向的梯度下降。大量实验表明,NSPO显著优于现有方法,在不牺牲通用任务(包括数学、代码和指令遵循任务)准确性的前提下,实现了最先进的安全性能。值得注意的是,NSPO具有数据高效性,仅需PKU-SafeRLHF中40%的公开人工标注安全数据即可达到优异的安全性能,而无需现有对齐方法所需的大量混合通用任务数据。