This study examines the digital representation of African languages and the challenges this presents for current language detection tools. We evaluate their performance on Yoruba, Kinyarwanda, and Amharic. While these languages are spoken by millions, their online usage on conversational platforms is often sparse, heavily influenced by English, and not representative of the authentic, monolingual conversations prevalent among native speakers. This lack of readily available authentic data online creates a challenge of scarcity of conversational data for training language models. To investigate this, data was collected from subreddits and local news sources for each language. The analysis showed a stark contrast between the two sources. Reddit data was minimal and characterized by heavy code-switching. Conversely, local news media offered a robust source of clean, monolingual language data, which also prompted more user engagement in the local language on the news publishers social media pages. Language detection models, including the specialized AfroLID and a general LLM, performed with near-perfect accuracy on the clean news data but struggled with the code-switched Reddit posts. The study concludes that professionally curated news content is a more reliable and effective source for training context-rich AI models for African languages than data from conversational platforms. It also highlights the need for future models that can process clean and code-switched text to improve the detection accuracy for African languages.
翻译:本研究探讨了非洲语言的数字表征现状及其对当前语言检测工具构成的挑战。我们评估了这些工具在约鲁巴语、基尼亚卢旺达语和阿姆哈拉语上的性能。尽管这些语言的使用者数以百万计,但它们在对话平台上的在线使用往往较为稀疏,受英语影响显著,且无法代表母语者之间普遍存在的真实单语对话。这种在线真实数据的缺乏,为训练语言模型带来了对话数据稀缺的挑战。为探究此问题,我们分别从各语言的子版块论坛和本地新闻源收集数据。分析显示,两种数据源之间存在鲜明对比:论坛数据量极少且以大量语码转换为特征;相反,本地新闻媒体提供了纯净单语数据的可靠来源,并促使新闻发布者社交媒体页面上出现更多本地语言的用户互动。语言检测模型(包括专门的AfroLID模型和通用大语言模型)在纯净新闻数据上表现出近乎完美的准确率,但在语码转换的论坛帖文中表现欠佳。研究结论表明,与对话平台数据相比,专业策划的新闻内容是为非洲语言训练上下文丰富的人工智能模型更可靠、更有效的资源。同时,本研究强调未来需要开发能够处理纯净文本与语码转换文本的模型,以提升非洲语言的检测准确率。