Alignment is a key step in developing Large Language Models (LLMs) using human feedback to ensure adherence to human values and societal norms. Dependence on human feedback raises privacy concerns about how much a labeler's preferences may reveal about their personal values, beliefs, and personality traits. Existing approaches, such as Differentially Private SGD (DP-SGD), provide rigorous privacy guarantees by privatizing gradients during fine-tuning and alignment but can provide more privacy than necessary as human preferences are tied only to labels of (prompt, response) pairs and can degrade model utility. This work focuses on LLM alignment with preference-level privacy, which preserves the privacy of preference labels provided by humans. We propose PROPS (PROgressively Private Self-alignment), a multi-stage privacy preserving alignment framework where privately aligned models in previous stages can serve as labelers for supplementing training data in the subsequent stages of alignment. We present theoretical guarantees for PROPS as well as comprehensive validation using multiple models (Pythia and GPT) and datasets (AlpacaEval, Anthropic HH-RLHF, truthy-dpo-v0.1) to demonstrate the utility of PROPS over existing methods while still providing high privacy. For the same privacy budget, alignment via PROPS can achieve up to 3x higher win-rates compared to DP-SGD, and 2.5x higher win-rates compared to Randomized Response (RR) based alignment.
翻译:对齐是利用人类反馈来确保大语言模型遵循人类价值观与社会规范的关键开发步骤。对人工反馈的依赖引发了隐私担忧,即标注者的偏好可能在多大程度上揭示其个人价值观、信念与人格特质。现有方法(如差分隐私随机梯度下降)通过在微调与对齐过程中对梯度进行隐私化处理来提供严格的隐私保证,但由于人类偏好仅与(提示,响应)对的标签相关联,这类方法可能提供超出必要的隐私保护,并损害模型效用。本研究聚焦于具备偏好级隐私的大语言模型对齐,旨在保护人类提供的偏好标签的隐私性。我们提出了PROPS(渐进式隐私自对齐),一种多阶段隐私保护对齐框架,其中前一阶段经隐私对齐的模型可作为标注者为后续对齐阶段补充训练数据。我们为PROPS提供了理论保证,并基于多种模型(Pythia与GPT)和数据集(AlpacaEval、Anthropic HH-RLHF、truthy-dpo-v0.1)进行了全面验证,结果表明PROPS在保持高隐私水平的同时,其效用优于现有方法。在相同隐私预算下,通过PROPS进行对齐可获得相较于差分隐私随机梯度下降最高3倍的胜率提升,以及相较于基于随机响应的对齐方法最高2.5倍的胜率提升。