Deep learning increasingly relies on massive data with substantial storage, annotation, and training costs. To reduce costs, coreset selection finds a representative subset of data to train models while ideally performing on par with the full data training. To maximize performance, current state-of-the-art coreset methods select data using dataset-specific ground truth labels and training. However, these methodological requirements prevent selection at scale on real-world, unlabeled data. To that end, this paper addresses the selection of coresets that achieve state-of-the-art performance but without using any labels or training on candidate data. Instead, our solution, Zero-Shot Coreset Selection via Iterative Subspace Sampling (ZCore), uses previously-trained foundation models to generate zero-shot, high-dimensional embedding spaces to interpret unlabeled data. ZCore then iteratively quantifies the relative value of all candidate data based on coverage and redundancy in numerous subspace distributions. Finally, ZCore selects a coreset sized for any data budget to train downstream models. We evaluate ZCore on four datasets and outperform several state-of-the-art label-based methods, especially at low data rates that provide the most substantial cost reduction. On ImageNet, ZCore selections for 10% training data achieve a downstream validation accuracy of 53.99%, which outperforms prior label-based methods and removes annotation and training costs for 1.15 million images. Our paper's code is publicly available at https://github.com/voxel51/zcore.
翻译:深度学习日益依赖于海量数据,这带来了巨大的存储、标注和训练成本。为降低成本,核心集选择旨在寻找一个具有代表性的数据子集来训练模型,并理想地达到与全数据训练相当的性能。为最大化性能,当前最先进的核心集方法通常依赖数据集特定的真实标签和训练过程来选择数据。然而,这些方法要求阻碍了在现实世界无标注数据上进行大规模选择。为此,本文致力于在不使用任何标签或对候选数据进行训练的情况下,选择能够达到最先进性能的核心集。我们的解决方案——基于迭代子空间采样的零样本核心集选择(ZCore)——利用预训练的基础模型生成零样本高维嵌入空间来解析无标注数据。ZCore随后通过迭代量化所有候选数据在多个子空间分布中的覆盖度和冗余度,以评估其相对价值。最后,ZCore根据任意数据预算选择核心集,用于训练下游模型。我们在四个数据集上评估ZCore,其性能超越了多种基于标签的最先进方法,尤其在低数据率下(此时成本降低最为显著)表现突出。在ImageNet上,ZCore为10%训练数据选择的核心集实现了53.99%的下游验证准确率,优于先前基于标签的方法,并为115万张图像免除了标注和训练成本。本文代码已公开于https://github.com/voxel51/zcore。