面向数据高效土地覆盖分割的核心集选择方法 (Core-Set Selection for Data-efficient Land Cover Segmentation)

Keiller Nogueira,Akram Zaytar,Wanli Ma,Ribana Roscher,Ronny Hansch,Caleb Robinson,Anthony Ortiz,Simone Nsutezo,Rahul Dodhia,Juan M. Lavista Ferres,Oktay Karakus,Paul L. Rosin

The increasing accessibility of remotely sensed data and their potential to support large-scale decision-making have driven the development of deep learning models for many Earth Observation tasks. Traditionally, such models rely on large datasets. However, the common assumption that larger training datasets lead to better performance tends to overlook issues related to data redundancy, noise, and the computational cost of processing massive datasets. Effective solutions must therefore consider not only the quantity but also the quality of data. Towards this, in this paper, we introduce six basic core-set selection approaches -- that rely on imagery only, labels only, or a combination of both -- and investigate whether they can identify high-quality subsets of data capable of maintaining -- or even surpassing -- the performance achieved when using full datasets for remote sensing semantic segmentation. We benchmark such approaches against two traditional baselines on three widely used land-cover classification datasets (DFC2022, Vaihingen, and Potsdam) using two different architectures (SegFormer and U-Net), thus establishing a general baseline for future works. Our experiments show that all proposed methods consistently outperform the baselines across multiple subset sizes, with some approaches even selecting core sets that surpass training on all available data. Notably, on DFC2022, a selected subset comprising only 25% of the training data yields slightly higher SegFormer performance than training with the entire dataset. This result shows the importance and potential of data-centric learning for the remote sensing domain. The code is available at https://github.com/keillernogueira/data-centric-rs-classification/.

翻译：遥感数据的日益普及及其支持大规模决策的潜力推动了深度学习模型在许多地球观测任务中的应用。传统上，这类模型依赖于大规模数据集。然而，更大训练数据集带来更好性能的常见假设往往忽视了数据冗余、噪声以及处理海量数据集的计算成本等问题。因此，有效的解决方案必须同时考虑数据的数量与质量。为此，本文介绍了六种基于核心集选择的基本方法——这些方法仅依赖影像、仅依赖标签或两者结合——并探究它们能否识别出高质量的数据子集，以保持甚至超越使用完整数据集进行遥感语义分割所达到的性能。我们在三个广泛使用的土地覆盖分类数据集（DFC2022、Vaihingen和Potsdam）上，采用两种不同架构（SegFormer和U-Net），将这些方法与两种传统基线进行基准测试，从而为未来研究建立了通用基准。实验表明，所有提出的方法在多种子集规模下均持续优于基线，部分方法选出的核心集甚至超越了使用全部可用数据训练的效果。值得注意的是，在DFC2022数据集上，仅使用25%训练数据选出的子集，其SegFormer性能略高于使用完整数据集训练的结果。这一发现揭示了以数据为中心的学习方法在遥感领域的重要性和潜力。相关代码已发布于 https://github.com/keillernogueira/data-centric-rs-classification/。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日