Subnational location data of disaster events are critical for risk assessment and disaster risk reduction. Disaster databases such as EM-DAT often report locations in unstructured textual form, with inconsistent granularity or spelling, that make it difficult to integrate with spatial datasets. We present a fully automated LLM-assisted workflow that processes and cleans textual location information using GPT-4o, and assigns geometries by cross-checking three independent geoinformation repositories: GADM, OpenStreetMap and Wikidata. Based on the agreement and availability of these sources, we assign a reliability score to each location while generating subnational geometries. Applied to the EM-DAT dataset from 2000 to 2024, the workflow geocodes 14,215 events across 17,948 unique locations. Unlike previous methods, our approach requires no manual intervention, covers all disaster types, enables cross-verification across multiple sources, and allows flexible remapping to preferred frameworks. Beyond the dataset, we demonstrate the potential of LLMs to extract and structure geographic information from unstructured text, offering a scalable and reliable method for related analyses.
翻译:灾害事件的次国家级位置数据对于风险评估和灾害风险降低至关重要。EM-DAT等灾害数据库通常以非结构化文本形式报告位置信息,其粒度或拼写不一致,导致难以与空间数据集整合。我们提出了一种全自动的LLM辅助工作流程,利用GPT-4o处理和清理文本位置信息,并通过交叉验证三个独立的地理信息库(GADM、OpenStreetMap和Wikidata)来分配几何形状。基于这些来源的一致性和可用性,我们在生成次国家级几何形状的同时为每个位置分配可靠性评分。将该工作流程应用于2000年至2024年的EM-DAT数据集,成功对14,215个事件(涉及17,948个独立位置)进行了地理编码。与先前方法不同,我们的方法无需人工干预,涵盖所有灾害类型,支持跨多源交叉验证,并允许灵活映射到首选框架。除数据集外,我们展示了LLM从非结构化文本中提取和构建地理信息的潜力,为相关分析提供了一种可扩展且可靠的方法。