Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \textquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62\%, nearly doubling the 23.86\% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.
翻译:遥感图像的语义检索是一项关键任务,其根本挑战在于“语义鸿沟”,即模型底层视觉特征与人类高层概念之间的差异。尽管大型视觉语言模型为弥合这一鸿沟提供了有前景的路径,但现有方法通常依赖于成本高昂的领域特定训练,并且缺乏评估VLM生成文本在零样本检索场景中实际效用的基准。为填补这一研究空白,我们引入了遥感富文本数据集,这是一个每幅图像配有多条结构化描述的新基准。基于此数据集,我们提出了一种完全无需训练、纯文本的检索参考方法TRSLLaVA。我们的方法将跨模态检索重新定义为文本到文本匹配问题,在统一的文本嵌入空间中,利用富文本描述作为查询,与VLM生成的描述数据库进行匹配。该方法完全避免了模型训练或微调。在RSITMD和RSICD基准上的实验表明,我们的无训练方法在性能上与最先进的监督模型高度竞争。例如,在RSITMD上,我们的方法实现了42.62%的平均召回率,几乎是标准零样本CLIP基线23.86%的两倍,并超越了多个顶级监督模型。这验证了通过结构化文本实现的高质量语义表示为遥感图像检索提供了一种强大且经济高效的范式。