Data is a critical asset for training large language models (LLMs), alongside compute resources and skilled workers. While some training data is publicly available, substantial investment is required to generate proprietary datasets, such as human preference annotations or to curate new ones from existing sources. As larger datasets generally yield better model performance, two natural questions arise. First, how can data owners make informed decisions about curation strategies and data sources investment? Second, how can multiple data owners collaboratively pool their resources to train superior models while fairly distributing the benefits? This problem, data valuation, which is not specific to large language models, has been addressed by the machine learning community through the lens of cooperative game theory, with the Shapley value being the prevalent solution concept. However, computing Shapley values is notoriously expensive for data valuation, typically requiring numerous model retrainings, which can become prohibitive for large machine learning models. In this work, we demonstrate that this computational challenge is dramatically simplified for LLMs trained with Direct Preference Optimization (DPO). We show how the specific mathematical structure of DPO enables scalable Shapley value computation. We believe this observation unlocks many applications at the intersection of data valuation and large language models.
翻译:数据是训练大语言模型(LLMs)的关键资产,与计算资源和专业人才同等重要。虽然部分训练数据可公开获取,但生成专有数据集(如人工偏好标注)或从现有来源筛选新数据仍需大量投入。由于更大的数据集通常能带来更优的模型性能,两个核心问题随之产生:其一,数据所有者如何基于数据筛选策略和数据源投资做出明智决策?其二,多个数据所有者如何协作整合资源以训练更优模型,同时公平分配收益?这一非大语言模型特有的数据价值评估问题,机器学习领域已通过合作博弈论框架予以探讨,其中沙普利值成为主流解决方案。然而,沙普利值的计算在数据价值评估中因需多次模型重训练而成本高昂,对于大型机器学习模型尤为突出。本研究证明,对于采用直接偏好优化(DPO)训练的大语言模型,该计算挑战可显著简化。我们揭示了DPO特有的数学结构如何实现可扩展的沙普利值计算。这一发现有望为数据价值评估与大语言模型的交叉领域开启诸多应用前景。