协同训练视觉语言模型用于遥感多任务学习 (Co-Training Vision Language Models for Remote Sensing Multi-task Learning)

With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model's object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.

翻译：随着Transformer在单个遥感任务上取得卓越性能，我们正逐步实现通过多任务学习构建一个在多项任务上均表现优异的统一模型。与单任务方法相比，多任务学习方法具有更好的泛化能力、更强的可扩展性以及更高的实际应用价值。近年来，视觉语言模型分别在遥感图像理解、定位及超高分辨率图像推理任务中取得了显著成果。此外，基于文本的统一接口展现出在多任务学习中的巨大潜力。为此，本文提出RSCoVLM——一个简洁而灵活的遥感多任务学习视觉语言模型基线。首先，我们构建了数据治理引擎，涵盖数据采集、离线处理与整合，以及在线加载与加权等环节。该引擎有效应对了复杂的遥感数据环境，并生成灵活的视觉-语言对话数据。进一步，我们提出统一动态分辨率策略以处理遥感图像固有的多尺度特性。针对超高分辨率图像，我们引入Zoom-in Chain机制及其对应数据集LRS-VQA-Zoom。这些策略具有灵活性，能有效缓解计算负担。此外，我们显著提升了模型的物体检测能力，并提出一种新的评估协议，确保视觉语言模型与传统检测模型之间的公平比较。大量实验表明，RSCoVLM在多项任务中均达到最先进性能，超越现有遥感视觉语言模型，甚至可与专业专家模型相媲美。所有训练评估工具、模型权重及数据集均已完全开源以支持可复现性。我们期望该基线能推动通用遥感模型的进一步发展。

相关内容

多任务学习

关注 161

多任务学习（MTL）是机器学习的一个子领域，可以同时解决多个学习任务，同时利用各个任务之间的共性和差异。与单独训练模型相比，这可以提高特定任务模型的学习效率和预测准确性。多任务学习是归纳传递的一种方法，它通过将相关任务的训练信号中包含的域信息用作归纳偏差来提高泛化能力。通过使用共享表示形式并行学习任务来实现,每个任务所学的知识可以帮助更好地学习其它任务。

面向机器人操作的基于大型视觉‑语言模型（VLM）的视觉‑语言‑动作（VLA）模型综述

专知会员服务

33+阅读 · 8月19日

用于识别任务的视觉 Transformer 综述

专知会员服务

75+阅读 · 2023年2月25日

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

专知会员服务

17+阅读 · 2022年5月10日

用于大型遥感影像检索的深度学习，Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives

专知会员服务

39+阅读 · 2020年4月6日