In regions of the Middle East and North Africa (MENA), there is a high demand for wastewater treatment plants (WWTPs), crucial for sustainable water management. Precise identification of WWTPs from satellite images enables environmental monitoring. Traditional methods like YOLOv8 segmentation require extensive manual labeling. But studies indicate that vision-language models (VLMs) are an efficient alternative to achieving equivalent or superior results through inherent reasoning and annotation. This study presents a structured methodology for VLM comparison, divided into zero-shot and few-shot streams specifically to identify WWTPs. The YOLOv8 was trained on a governmental dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives), 15% non-WWTPs (negatives). Evaluated VLMs include LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B (Mistral), used to identify WWTP components such as circular/rectangular tanks, aeration basins and distinguish confounders via expert prompts producing JSON outputs with confidence and descriptions. The dataset comprises 1,207 validated WWTP locations (198 UAE, 354 KSA, 655 Egypt) and equal non-WWTP sites from field/AI data, as 600mx600m Geo-TIFF images (Zoom 18, EPSG:4326). Zero-shot evaluations on WWTP images showed several VLMs out-performing YOLOv8's true positive rate, with Gemma-3 highest. Results confirm that VLMs, particularly with zero-shot, can replace YOLOv8 for efficient, annotation-free WWTP classification, enabling scalable remote sensing.
翻译:在中东与北非(MENA)地区,污水处理厂(WWTPs)作为可持续水资源管理的关键设施存在巨大需求。通过卫星影像精确识别污水处理厂可实现环境监测。传统方法如YOLOv8分割需要大量人工标注,但研究表明视觉语言模型(VLMs)凭借其内在推理与标注能力,可作为实现同等或更优结果的高效替代方案。本研究提出了一套用于VLM比较的结构化方法,专门针对污水处理厂识别任务分为零样本与小样本两条技术路径。YOLOv8在包含埃及、沙特阿拉伯和阿联酋的83,566幅高分辨率卫星影像的政府数据集上训练,其中约85%为污水处理厂(正样本),15%为非污水处理厂(负样本)。评估的VLM包括LLaMA 3.2 Vision、Qwen 2.5 VL、DeepSeek-VL2、Gemma 3、Gemini及Pixtral 12B(Mistral),这些模型通过专家提示词识别圆形/矩形水池、曝气池等污水处理厂组件,并区分干扰物,最终输出包含置信度与描述的JSON结果。数据集包含1,207个已验证污水处理厂位置(阿联酋198个、沙特354个、埃及655个)及等量的来自实地/AI数据的非污水处理厂点位,均以600m×600m的Geo-TIFF影像存储(缩放等级18,EPSG:4326坐标系)。在污水处理厂影像上的零样本评估显示,多个VLM的真阳性率超越YOLOv8,其中Gemma-3表现最佳。结果证实VLM(特别是零样本方法)可替代YOLOv8实现高效、免标注的污水处理厂分类,为可扩展的遥感监测提供技术支持。