Dataset distillation (DD) compresses large datasets into smaller ones while preserving the performance of models trained on them. Although DD is often assumed to enhance data privacy by aggregating over individual examples, recent studies reveal that standard DD can still leak sensitive information from the original dataset due to the lack of formal privacy guarantees. Existing differentially private (DP)-DD methods attempt to mitigate this risk by injecting noise into the distillation process. However, they often fail to fully leverage the original dataset, resulting in degraded realism and utility. This paper introduces \libn, a novel framework that addresses the key limitations of current DP-DD by leveraging DP-generated data. Specifically, \lib initializes the distilled dataset with DP-generated data to enhance realism. Then, generated data refines the DP-feature matching technique to distill the original dataset under a small privacy budget, and trains an expert model to align the distilled examples with their class distribution. Furthermore, we design a privacy budget allocation strategy to determine budget consumption across DP components and provide a theoretical analysis of the overall privacy guarantees. Extensive experiments show that \lib significantly outperforms state-of-the-art DP-DD methods in terms of both dataset utility and robustness against membership inference attacks, establishing a new paradigm for privacy-preserving dataset distillation.
翻译:数据集蒸馏(DD)通过将大型数据集压缩为更小的数据集,同时保持在其上训练模型的性能。尽管DD通常被认为通过聚合个体样本来增强数据隐私,但最近的研究表明,由于缺乏形式化的隐私保证,标准DD仍可能泄露原始数据集中的敏感信息。现有的差分隐私(DP)-DD方法试图通过在蒸馏过程中注入噪声来缓解这一风险。然而,这些方法往往未能充分利用原始数据集,导致真实性和效用下降。本文提出了\\libn,一种新颖的框架,通过利用DP生成的数据来解决当前DP-DD的关键局限性。具体而言,\\lib使用DP生成的数据初始化蒸馏数据集以增强真实性。随后,生成数据优化了DP特征匹配技术,在较小的隐私预算下蒸馏原始数据集,并训练一个专家模型以使蒸馏样本与其类别分布对齐。此外,我们设计了一种隐私预算分配策略,以确定DP组件间的预算消耗,并对整体隐私保证进行了理论分析。大量实验表明,\\lib在数据集效用和对抗成员推理攻击的鲁棒性方面均显著优于最先进的DP-DD方法,为隐私保护的数据集蒸馏建立了新范式。