迈向基于语料库的代理式大语言模型用于多语言语法分析 (Towards Corpus-Grounded Agentic LLMs for Multilingual Grammatical Analysis)

Empirical grammar research has become increasingly data-driven, but the systematic analysis of annotated corpora still requires substantial methodological and technical effort. We explore how agentic large language models (LLMs) can streamline this process by reasoning over annotated corpora and producing interpretable, data-grounded answers to linguistic questions. We introduce an agentic framework for corpus-grounded grammatical analysis that integrates concepts such as natural-language task interpretation, code generation, and data-driven reasoning. As a proof of concept, we apply it to Universal Dependencies (UD) corpora, testing it on multilingual grammatical tasks inspired by the World Atlas of Language Structures (WALS). The evaluation spans 13 word-order features and over 170 languages, assessing system performance across three complementary dimensions - dominant-order accuracy, order-coverage completeness, and distributional fidelity - which reflect how well the system generalizes, identifies, and quantifies word-order variations. The results demonstrate the feasibility of combining LLM reasoning with structured linguistic data, offering a first step toward interpretable, scalable automation of corpus-based grammatical inquiry.

翻译：实证语法研究日益数据驱动，但对标注语料库的系统分析仍需大量方法学与技术投入。本文探讨了代理式大语言模型如何通过推理标注语料库，为语言学问题生成可解释、数据驱动的答案，从而简化这一流程。我们提出了一种基于语料库的语法分析代理框架，整合了自然语言任务解析、代码生成与数据驱动推理等概念。作为概念验证，我们将其应用于通用依存关系语料库，并基于《世界语言结构图谱》启发的多语言语法任务进行测试。评估涵盖13种语序特征和超过170种语言，从三个互补维度评估系统性能——主导语序准确性、语序覆盖完整性和分布保真度，这些维度反映了系统在概括、识别和量化语序变异方面的能力。结果表明，将大语言模型推理与结构化语言数据相结合具有可行性，为基于语料库的语法研究迈向可解释、可扩展的自动化迈出了第一步。