基于大语言模型的全球灾害次国家级地理编码 (Subnational Geocoding of Global Disasters Using Large Language Models)

Subnational location data of disaster events are critical for risk assessment and disaster risk reduction. Disaster databases such as EM-DAT often report locations in unstructured textual form, with inconsistent granularity or spelling, that make it difficult to integrate with spatial datasets. We present a fully automated LLM-assisted workflow that processes and cleans textual location information using GPT-4o, and assigns geometries by cross-checking three independent geoinformation repositories: GADM, OpenStreetMap and Wikidata. Based on the agreement and availability of these sources, we assign a reliability score to each location while generating subnational geometries. Applied to the EM-DAT dataset from 2000 to 2024, the workflow geocodes 14,215 events across 17,948 unique locations. Unlike previous methods, our approach requires no manual intervention, covers all disaster types, enables cross-verification across multiple sources, and allows flexible remapping to preferred frameworks. Beyond the dataset, we demonstrate the potential of LLMs to extract and structure geographic information from unstructured text, offering a scalable and reliable method for related analyses.

翻译：灾害事件的次国家级位置数据对于风险评估和灾害风险降低至关重要。EM-DAT等灾害数据库通常以非结构化文本形式报告位置信息，其粒度或拼写不一致，导致难以与空间数据集整合。我们提出了一种全自动的LLM辅助工作流程，利用GPT-4o处理和清理文本位置信息，并通过交叉验证三个独立的地理信息库（GADM、OpenStreetMap和Wikidata）来分配几何形状。基于这些来源的一致性和可用性，我们在生成次国家级几何形状的同时为每个位置分配可靠性评分。将该工作流程应用于2000年至2024年的EM-DAT数据集，成功对14,215个事件（涉及17,948个独立位置）进行了地理编码。与先前方法不同，我们的方法无需人工干预，涵盖所有灾害类型，支持跨多源交叉验证，并允许灵活映射到首选框架。除数据集外，我们展示了LLM从非结构化文本中提取和构建地理信息的潜力，为相关分析提供了一种可扩展且可靠的方法。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

图增强生成（GraphRAG）

专知会员服务

33+阅读 · 1月4日

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

专知会员服务

17+阅读 · 2022年5月10日

【KDD2020】图神经网络生成式预训练，GPT-GNN: Generative Pre-Training of Graph Neural Networks

专知会员服务

99+阅读 · 2020年7月3日

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日