Accurate road damage detection is crucial for timely infrastructure maintenance and public safety, but existing vision-only datasets and models lack the rich contextual understanding that textual information can provide. To address this limitation, we introduce RoadBench, the first multimodal benchmark for comprehensive road damage understanding. This dataset pairs high resolution images of road damages with detailed textual descriptions, providing a richer context for model training. We also present RoadCLIP, a novel vision language model that builds upon CLIP by integrating domain specific enhancements. It includes a disease aware positional encoding that captures spatial patterns of road defects and a mechanism for injecting road-condition priors to refine the model's understanding of road damages. We further employ a GPT driven data generation pipeline to expand the image to text pairs in RoadBench, greatly increasing data diversity without exhaustive manual annotation. Experiments demonstrate that RoadCLIP achieves state of the art performance on road damage recognition tasks, significantly outperforming existing vision-only models by 19.2%. These results highlight the advantages of integrating visual and textual information for enhanced road condition analysis, setting new benchmarks for the field and paving the way for more effective infrastructure monitoring through multimodal learning.
翻译:精确的道路损伤检测对于基础设施的及时维护和公共安全至关重要,但现有的纯视觉数据集和模型缺乏文本信息所能提供的丰富上下文理解。为应对这一局限,我们提出了RoadBench——首个用于全面道路损伤理解的多模态基准数据集。该数据集将高分辨率道路损伤图像与详细的文本描述配对,为模型训练提供了更丰富的上下文信息。同时,我们提出了RoadCLIP,这是一种基于CLIP架构构建的新型视觉语言模型,通过集成领域特定增强模块实现性能提升。该模型包含一种能够捕捉道路缺陷空间模式的病害感知位置编码机制,以及一种注入道路状况先验知识以优化模型对道路损伤理解的模块。此外,我们采用基于GPT的数据生成流程扩展了RoadBench中的图像-文本对,在无需大量人工标注的情况下显著提升了数据多样性。实验表明,RoadCLIP在道路损伤识别任务中达到了最先进的性能水平,较现有纯视觉模型的准确率显著提升19.2%。这些结果凸显了融合视觉与文本信息在增强道路状况分析方面的优势,为该领域设立了新的基准,并通过多模态学习为更高效的基础设施监测开辟了新路径。