This study presents an LLM-assisted annotation pipeline for the sociolinguistic and topical analysis of bilingual discourse in two typologically distinct contexts: Spanish-English and Spanish-Guaraní. Using large language models, we automatically labeled topic, genre, and discourse-pragmatic functions across a total of 3,691 code-switched sentences, integrated demographic metadata from the Miami Bilingual Corpus, and enriched the Spanish-Guaraní dataset with new topic annotations. The resulting distributions reveal systematic links between gender, language dominance, and discourse function in the Miami data, and a clear diglossic division between formal Guaraní and informal Spanish in Paraguayan texts. These findings replicate and extend earlier interactional and sociolinguistic observations with corpus-scale quantitative evidence. The study demonstrates that large language models can reliably recover interpretable sociolinguistic patterns traditionally accessible only through manual annotation, advancing computational methods for cross-linguistic and low-resource bilingual research.
翻译:本研究提出一种基于大语言模型的标注流程,用于在两种类型学差异显著的语境(西班牙语-英语和西班牙语-瓜拉尼语)中对双语话语进行社会语言学及话题分析。通过大语言模型,我们自动标注了总计3,691个语码转换句子的主题、体裁及话语-语用功能,整合了迈阿密双语语料库的人口统计学元数据,并为西班牙语-瓜拉尼语数据集新增了主题标注。分析所得的分布规律揭示了迈阿密数据中性别、语言主导性与话语功能之间的系统性关联,以及巴拉圭文本中正式的瓜拉尼语与非正式的西班牙语之间明显的双言制区隔。这些发现以语料库规模的量化证据复现并拓展了先前的互动与社会语言学观察。研究表明,大语言模型能够可靠地提取可解释的社会语言学模式——这些模式传统上仅能通过人工标注获得,从而推动了跨语言及低资源双语研究的计算方法发展。