High quality data is needed to unlock the full potential of AI for end users. However finding new sources of such data is getting harder: most publicly-available human generated data will soon have been used. Additionally, publicly available data often is not representative of users of a particular system -- for example, a research speech dataset of contractors interacting with an AI assistant will likely be more homogeneous, well articulated and self-censored than real world commands that end users will issue. Therefore unlocking high-quality data grounded in real user interactions is of vital interest. However, the direct use of user data comes with significant privacy risks. Differential Privacy (DP) is a well established framework for reasoning about and limiting information leakage, and is a gold standard for protecting user privacy. The focus of this work, \emph{Differentially Private Synthetic data}, refers to synthetic data that preserves the overall trends of source data,, while providing strong privacy guarantees to individuals that contributed to the source dataset. DP synthetic data can unlock the value of datasets that have previously been inaccessible due to privacy concerns and can replace the use of sensitive datasets that previously have only had rudimentary protections like ad-hoc rule-based anonymization. In this paper we explore the full suite of techniques surrounding DP synthetic data, the types of privacy protections they offer and the state-of-the-art for various modalities (image, tabular, text and decentralized). We outline all the components needed in a system that generates DP synthetic data, from sensitive data handling and preparation, to tracking the use and empirical privacy testing. We hope that work will result in increased adoption of DP synthetic data, spur additional research and increase trust in DP synthetic data approaches.
翻译:高质量数据是释放人工智能对终端用户全部潜力的关键。然而,获取此类新数据源正变得日益困难:大多数公开可用的人类生成数据即将被耗尽。此外,公开数据往往无法代表特定系统的用户群体——例如,研究语音数据集中承包商与AI助手的交互,可能比终端用户实际发出的指令更为同质化、表达清晰且自我审查。因此,挖掘基于真实用户交互的高质量数据至关重要。然而,直接使用用户数据会带来显著的隐私风险。差分隐私(DP)是一个成熟的框架,用于推理和限制信息泄露,是保护用户隐私的黄金标准。本文聚焦于差分隐私合成数据,指在保留源数据整体趋势的同时,为源数据集中的个体提供强隐私保证的合成数据。DP合成数据能够释放因隐私问题而此前无法访问的数据集价值,并可替代仅采用临时规则匿名化等基础保护措施的敏感数据集。本文系统探讨了DP合成数据相关的全套技术、其提供的隐私保护类型,以及在不同模态(图像、表格、文本和去中心化数据)下的最新进展。我们概述了生成DP合成数据系统所需的所有组件,包括敏感数据处理与准备、使用追踪及实证隐私测试。我们希望这项工作能促进DP合成数据的更广泛采用,推动进一步研究,并增强对DP合成数据方法的信任。