Image retrieval-based cross-view geo-localization (IRCVGL) aims to match images captured from significantly different viewpoints, such as satellite and street-level images. Existing methods predominantly rely on learning robust global representations or implicit feature alignment, which often fail to model explicit spatial correspondences crucial for accurate localization. In this work, we propose a novel correspondence-aware feature refinement framework, termed CLNet, that explicitly bridges the semantic and geometric gaps between different views. CLNet decomposes the view alignment process into three learnable and complementary modules: a Neural Correspondence Map (NCM) that spatially aligns cross-view features via latent correspondence fields; a Nonlinear Embedding Converter (NEC) that remaps features across perspectives using an MLP-based transformation; and a Global Feature Recalibration (GFR) module that reweights informative feature channels guided by learned spatial cues. The proposed CLNet can jointly capture both high-level semantics and fine-grained alignments. Extensive experiments on four public benchmarks, CVUSA, CVACT, VIGOR, and University-1652, demonstrate that our proposed CLNet achieves state-of-the-art performance while offering better interpretability and generalizability.
翻译:基于图像检索的跨视角地理定位旨在匹配从显著不同视角(如卫星与街景图像)捕获的图像。现有方法主要依赖学习鲁棒的全局表征或隐式特征对齐,往往难以建模对精确定位至关重要的显式空间对应关系。本文提出一种新颖的对应感知特征优化框架CLNet,显式弥合不同视角间的语义与几何鸿沟。CLNet将视角对齐过程分解为三个可学习且互补的模块:通过潜在对应场实现跨视角特征空间对齐的神经对应图模块;利用基于MLP的变换实现跨视角特征重映射的非线性嵌入转换器;以及通过习得空间线索引导信息特征通道重加权的全局特征重校准模块。所提出的CLNet能够联合捕获高层语义与细粒度对齐。在CVUSA、CVACT、VIGOR和University-1652四个公开基准上的大量实验表明,CLNet在实现最先进性能的同时,具备更优的可解释性与泛化能力。