CoDA：从文本到图像扩散模型到免训练数据集蒸馏 (CoDA: From Text-to-Image Diffusion Models to Training-Free Dataset Distillation)

Prevailing Dataset Distillation (DD) methods leveraging generative models confront two fundamental limitations. First, despite pioneering the use of diffusion models in DD and delivering impressive performance, the vast majority of approaches paradoxically require a diffusion model pre-trained on the full target dataset, undermining the very purpose of DD and incurring prohibitive training costs. Second, although some methods turn to general text-to-image models without relying on such target-specific training, they suffer from a significant distributional mismatch, as the web-scale priors encapsulated in these foundation models fail to faithfully capture the target-specific semantics, leading to suboptimal performance. To tackle these challenges, we propose Core Distribution Alignment (CoDA), a framework that enables effective DD using only an off-the-shelf text-to-image model. Our key idea is to first identify the "intrinsic core distribution" of the target dataset using a robust density-based discovery mechanism. We then steer the generative process to align the generated samples with this core distribution. By doing so, CoDA effectively bridges the gap between general-purpose generative priors and target semantics, yielding highly representative distilled datasets. Extensive experiments suggest that, without relying on a generative model specifically trained on the target dataset, CoDA achieves performance on par with or even superior to previous methods with such reliance across all benchmarks, including ImageNet-1K and its subsets. Notably, it establishes a new state-of-the-art accuracy of 60.4% at the 50-images-per-class (IPC) setup on ImageNet-1K. Our code is available on the project webpage: https://github.com/zzzlt422/CoDA

翻译：当前基于生成模型的数据集蒸馏方法面临两个根本性局限。首先，尽管扩散模型在数据集蒸馏中开创性应用并展现出卓越性能，绝大多数方法却矛盾地需要基于完整目标数据集预训练的扩散模型，这违背了数据集蒸馏的初衷并带来极高的训练成本。其次，虽然部分方法转向不依赖此类目标特定训练的通用文本到图像模型，但它们存在显著的分布失配问题——这些基础模型中封装的网络规模先验知识无法准确捕捉目标特定语义，导致性能欠佳。为应对这些挑战，我们提出核心分布对齐框架，该框架仅使用现成的文本到图像模型即可实现高效的数据集蒸馏。我们的核心思路是：首先通过鲁棒的基于密度的发现机制识别目标数据集的“内在核心分布”，随后引导生成过程使生成样本与该核心分布对齐。通过这种方式，CoDA有效弥合了通用生成先验与目标语义之间的鸿沟，产生具有高度代表性的蒸馏数据集。大量实验表明，在不依赖目标数据集专门训练的生成模型情况下，CoDA在所有基准测试（包括ImageNet-1K及其子集）中达到甚至超越了依赖此类模型的现有方法性能。值得注意的是，在ImageNet-1K的每类50图像设置下，本方法创造了60.4%的最新最优准确率记录。项目代码发布于：https://github.com/zzzlt422/CoDA

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【ICML2024】PrE-Text：在大规模语言模型（LLM）时代对私人联邦数据进行语言模型训练

专知会员服务

19+阅读 · 2024年6月6日

【CVPR2024】VidLA: 大规模视频-语言对齐

专知会员服务

20+阅读 · 2024年3月31日

ChatAug: 利用ChatGPT进行文本数据增强

专知会员服务

81+阅读 · 2023年3月4日

【CMU-Yuejie Chi等干货书】满足低秩矩阵分解的非凸优化综述，69页pdf，Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview

专知会员服务

33+阅读 · 2022年3月4日