低低有多低? 极低资源语言的计算视角 (How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages)

Despite the recent advancements of attention-based deep learning architectures across a majority of Natural Language Processing tasks, their application remains limited in a low-resource setting because of a lack of pre-trained models for such languages. In this study, we make the first attempt to investigate the challenges of adapting these techniques for an extremely low-resource language -- Sumerian cuneiform -- one of the world's oldest written languages attested from at least the beginning of the 3rd millennium BC. Specifically, we introduce the first cross-lingual information extraction pipeline for Sumerian, which includes part-of-speech tagging, named entity recognition, and machine translation. We further curate InterpretLR, an interpretability toolkit for low-resource NLP, and use it alongside human attributions to make sense of the models. We emphasize on human evaluations to gauge all our techniques. Notably, most components of our pipeline can be generalised to any other language to obtain an interpretable execution of the techniques, especially in a low-resource setting. We publicly release all software, model checkpoints, and a novel dataset with domain-specific pre-processing to promote further research.

翻译：尽管在大多数自然语言处理任务中,基于关注的深层次学习结构最近有所进展,但由于缺少这类语言的预培训模式,在低资源环境中应用这些结构仍然有限。在本研究中,我们第一次尝试调查如何将这些技术改造为一种极低资源语言 -- -- 苏美尔语库 -- -- 世界上最古老的书面语言之一 -- -- 至少从公元前三千年开始就已经证明。具体地说,我们为苏美尔语引入了第一个跨语言信息提取管道,其中包括部分语音标记、名称实体识别和机器翻译。我们进一步翻译了低资源国家语言方案(NLP)的可解释性工具包,并连同人类属性一起使用该工具包来理解这些模式。我们强调人类评估以衡量我们所有的技术。值得注意的是,我们的管道的大多数组成部分可以被广泛归纳为任何其他语言,以便对这些技术进行可解释的实施,特别是在低资源环境下。我们公开释放了所有软件、模型检查站和具有具体域预处理的新型数据集,以促进进一步的研究。