Prevailing Dataset Distillation (DD) methods leveraging generative models confront two fundamental limitations. First, despite pioneering the use of diffusion models in DD and delivering impressive performance, the vast majority of approaches paradoxically require a diffusion model pre-trained on the full target dataset, undermining the very purpose of DD and incurring prohibitive training costs. Second, although some methods turn to general text-to-image models without relying on such target-specific training, they suffer from a significant distributional mismatch, as the web-scale priors encapsulated in these foundation models fail to faithfully capture the target-specific semantics, leading to suboptimal performance. To tackle these challenges, we propose Core Distribution Alignment (CoDA), a framework that enables effective DD using only an off-the-shelf text-to-image model. Our key idea is to first identify the "intrinsic core distribution" of the target dataset using a robust density-based discovery mechanism. We then steer the generative process to align the generated samples with this core distribution. By doing so, CoDA effectively bridges the gap between general-purpose generative priors and target semantics, yielding highly representative distilled datasets. Extensive experiments suggest that, without relying on a generative model specifically trained on the target dataset, CoDA achieves performance on par with or even superior to previous methods with such reliance across all benchmarks, including ImageNet-1K and its subsets. Notably, it establishes a new state-of-the-art accuracy of 60.4% at the 50-images-per-class (IPC) setup on ImageNet-1K. Our code is available on the project webpage: https://github.com/zzzlt422/CoDA
翻译:当前基于生成模型的数据集蒸馏方法面临两个根本性局限。首先,尽管扩散模型在数据集蒸馏中开创性应用并展现出卓越性能,绝大多数方法却矛盾地需要基于完整目标数据集预训练的扩散模型,这违背了数据集蒸馏的初衷并带来极高的训练成本。其次,虽然部分方法转向不依赖此类目标特定训练的通用文本到图像模型,但它们存在显著的分布失配问题——这些基础模型中封装的网络规模先验知识无法准确捕捉目标特定语义,导致性能欠佳。为应对这些挑战,我们提出核心分布对齐框架,该框架仅使用现成的文本到图像模型即可实现高效的数据集蒸馏。我们的核心思路是:首先通过鲁棒的基于密度的发现机制识别目标数据集的“内在核心分布”,随后引导生成过程使生成样本与该核心分布对齐。通过这种方式,CoDA有效弥合了通用生成先验与目标语义之间的鸿沟,产生具有高度代表性的蒸馏数据集。大量实验表明,在不依赖目标数据集专门训练的生成模型情况下,CoDA在所有基准测试(包括ImageNet-1K及其子集)中达到甚至超越了依赖此类模型的现有方法性能。值得注意的是,在ImageNet-1K的每类50图像设置下,本方法创造了60.4%的最新最优准确率记录。项目代码发布于:https://github.com/zzzlt422/CoDA