The high cost of acquiring rural street view images has constrained comprehensive environmental perception in rural areas. Drone photographs, with their advantages of easy acquisition, broad coverage, and high spatial resolution, offer a viable approach for large-scale rural environmental perception. However, a systematic methodology for identifying key environmental elements from drone photographs and quantifying their impact on environmental perception remains lacking. To address this gap, a Vision-Language Contrastive Ranking Framework (VLCR) is designed for rural livability assessment in China. The framework employs chain-of-thought prompting strategies to guide multimodal large language models (MLLMs) in identifying visual features related to quality of life and ecological habitability from drone photographs. Subsequently, to address the instability in pairwise village comparison, a text description-constrained drone photograph comparison strategy is proposed. Finally, to overcome the efficiency bottleneck in nationwide pairwise village comparisons, an innovation ranking algorithm based on binary search interpolation is developed, which reduces the number of comparisons through automated selection of comparison targets. The proposed framework achieves superior performance with a Spearman Footrule distance of 0.74, outperforming mainstream commercial MLLMs by approximately 0.1. Moreover, the mechanism of concurrent comparison and ranking demonstrates a threefold enhancement in computational efficiency. Our framework has achieved data innovation and methodological breakthroughs in village livability assessment, providing strong support for large-scale village livability analysis. Keywords: Drone photographs, Environmental perception, Rural livability assessment, Multimodal large language models, Chain-of-thought prompting.
翻译:获取农村街景图像的高成本限制了农村地区环境感知的全面性。无人机照片凭借其易于获取、覆盖范围广和空间分辨率高的优势,为大规模农村环境感知提供了可行途径。然而,目前仍缺乏从无人机照片中识别关键环境要素并量化其对环境感知影响的系统方法。为填补这一空白,本研究设计了一个视觉-语言对比排序框架(VLCR),用于中国农村宜居性评估。该框架采用思维链提示策略,引导多模态大语言模型(MLLMs)从无人机照片中识别与生活质量和生态宜居性相关的视觉特征。随后,为解决村庄成对比较中的不稳定性问题,提出了一种文本描述约束的无人机照片比较策略。最后,为克服全国范围内村庄成对比较的效率瓶颈,开发了一种基于二分搜索插值的创新排序算法,通过自动选择比较目标来减少比较次数。所提框架在斯皮尔曼足距上取得了0.74的优异性能,优于主流商业MLLMs约0.1。此外,并行比较与排序机制的计算效率提升了三倍。我们的框架在村庄宜居性评估中实现了数据创新和方法突破,为大规模村庄宜居性分析提供了有力支持。关键词:无人机照片,环境感知,农村宜居性评估,多模态大语言模型,思维链提示。