This paper investigates the impact of using morphologically-informed tokenizers to aid and streamline the interlinear gloss annotation of an audio corpus of Yoloxóchitl Mixtec (YM) using a combination of ASR and text-based sequence-to-sequence tools, with the goal of improving efficiency while reducing the workload of a human annotator. We present two novel tokenization schemes that separate words in a nonlinear manner, preserving information about tonal morphology as much as possible. One of these approaches, a Segment and Melody tokenizer, simply extracts the tones without predicting segmentation. The other, a Sequence of Processes tokenizer, predicts segmentation for the words, which could allow an end-to-end ASR system to produce segmented and unsegmented transcriptions in a single pass. We find that these novel tokenizers are competitive with BPE and Unigram models, and the Segment-and-Melody model outperforms traditional tokenizers in terms of word error rate but does not reach the same character error rate. In addition, we analyze tokenizers on morphological and information-theoretic metrics to find predictive correlations with downstream performance. Our results suggest that nonlinear tokenizers designed specifically for the non-concatenative morphology of a language are competitive with conventional BPE and Unigram models for ASR. Further research will be necessary to determine the applicability of these tokenizers in downstream processing tasks.
翻译:本文研究了利用形态感知分词器辅助并简化Yoloxóchitl Mixtec(YM)音频语料层间注释标注的效果,该方法结合了自动语音识别(ASR)与基于文本的序列到序列工具,旨在提升标注效率并减轻人工标注负担。我们提出了两种新颖的分词方案,以非线性方式切分词汇,尽可能保留声调形态信息。其中一种方法为“分段与旋律分词器”,仅提取声调而不预测切分边界;另一种“过程序列分词器”则预测词汇切分,可使端到端ASR系统单次生成带切分与未切分的转写文本。实验表明,这些新型分词器在性能上与BPE和Unigram模型相当,其中分段与旋律模型在词错误率上优于传统分词器,但字符错误率未达到同等水平。此外,我们通过形态学与信息论指标分析分词器,发现其与下游任务性能存在预测性关联。研究结果表明,针对语言非连接形态特性专门设计的非线性分词器,在ASR任务中可与传统BPE及Unigram模型竞争。未来需进一步探究这些分词器在下游处理任务中的适用性。