DPO中数据的关键因素是什么？ (What Matters in Data for DPO?)

Direct Preference Optimization (DPO) has emerged as a simple and effective approach for aligning large language models (LLMs) with human preferences, bypassing the need for a learned reward model. Despite its growing adoption, a fundamental question remains open: what characteristics of preference data are most critical for DPO performance? In this work, we provide a systematic study of how preference data distribution influences DPO, from both theoretical and empirical perspectives. We show that the quality of chosen responses plays a dominant role in optimizing the DPO objective, while the quality of rejected responses may have relatively limited impact. Our theoretical analysis characterizes the optimal response distribution under DPO and reveals how contrastiveness between responses helps primarily by improving the chosen samples. We further study an online DPO setting and show it effectively reduces to supervised fine-tuning on the chosen responses. Extensive experiments across diverse tasks confirm our findings: improving the quality of chosen responses consistently boosts performance regardless of the quality of the rejected responses. We also investigate the benefit of mixing the on-policy data. Our results interpret the mechanism behind some widely adopted strategies and offer practical insights for constructing high-impact preference datasets for LLM alignment.

翻译：直接偏好优化（DPO）已成为一种简单而有效的方法，用于将大型语言模型（LLM）与人类偏好对齐，无需依赖学习到的奖励模型。尽管其应用日益广泛，一个基本问题仍未解决：偏好数据的哪些特性对DPO性能最为关键？本研究从理论和实证两个角度，系统探讨了偏好数据分布如何影响DPO。我们发现，在优化DPO目标时，被选回答的质量起主导作用，而被拒回答的质量影响相对有限。理论分析刻画了DPO下的最优回答分布，并揭示了回答间的对比性主要通过提升被选样本的质量发挥作用。我们进一步研究了在线DPO设置，证明其本质上可简化为对被选回答的监督微调。跨多种任务的广泛实验证实了我们的发现：无论被拒回答质量如何，提升被选回答的质量均能持续改善性能。我们还探讨了混合在线策略数据的益处。这些结果解释了某些广泛采用策略背后的机制，并为构建用于LLM对齐的高影响力偏好数据集提供了实用见解。