The increasing accessibility of remotely sensed data and their potential to support large-scale decision-making have driven the development of deep learning models for many Earth Observation tasks. Traditionally, such models rely on large datasets. However, the common assumption that larger training datasets lead to better performance tends to overlook issues related to data redundancy, noise, and the computational cost of processing massive datasets. Effective solutions must therefore consider not only the quantity but also the quality of data. Towards this, in this paper, we introduce six basic core-set selection approaches -- that rely on imagery only, labels only, or a combination of both -- and investigate whether they can identify high-quality subsets of data capable of maintaining -- or even surpassing -- the performance achieved when using full datasets for remote sensing semantic segmentation. We benchmark such approaches against two traditional baselines on three widely used land-cover classification datasets (DFC2022, Vaihingen, and Potsdam) using two different architectures (SegFormer and U-Net), thus establishing a general baseline for future works. Our experiments show that all proposed methods consistently outperform the baselines across multiple subset sizes, with some approaches even selecting core sets that surpass training on all available data. Notably, on DFC2022, a selected subset comprising only 25% of the training data yields slightly higher SegFormer performance than training with the entire dataset. This result shows the importance and potential of data-centric learning for the remote sensing domain. The code is available at https://github.com/keillernogueira/data-centric-rs-classification/.
翻译:遥感数据的日益普及及其支持大规模决策的潜力推动了深度学习模型在许多地球观测任务中的应用。传统上,这类模型依赖于大规模数据集。然而,更大训练数据集带来更好性能的常见假设往往忽视了数据冗余、噪声以及处理海量数据集的计算成本等问题。因此,有效的解决方案必须同时考虑数据的数量与质量。为此,本文介绍了六种基于核心集选择的基本方法——这些方法仅依赖影像、仅依赖标签或两者结合——并探究它们能否识别出高质量的数据子集,以保持甚至超越使用完整数据集进行遥感语义分割所达到的性能。我们在三个广泛使用的土地覆盖分类数据集(DFC2022、Vaihingen和Potsdam)上,采用两种不同架构(SegFormer和U-Net),将这些方法与两种传统基线进行基准测试,从而为未来研究建立了通用基准。实验表明,所有提出的方法在多种子集规模下均持续优于基线,部分方法选出的核心集甚至超越了使用全部可用数据训练的效果。值得注意的是,在DFC2022数据集上,仅使用25%训练数据选出的子集,其SegFormer性能略高于使用完整数据集训练的结果。这一发现揭示了以数据为中心的学习方法在遥感领域的重要性和潜力。相关代码已发布于 https://github.com/keillernogueira/data-centric-rs-classification/。