The relationship between tokenizer algorithm (e.g., Byte-Pair Encoding (BPE), Unigram), morphological alignment, tokenization quality (e.g., compression efficiency), and downstream performance remains largely unclear, particularly for languages with complex morphology. In this paper, we conduct a comprehensive evaluation of tokenizers using small-sized BERT models -- from pre-training through fine-tuning -- for Telugu (agglutinative), along with preliminary evaluation in Hindi (primarily fusional with some agglutination) and English (fusional). To evaluate morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms. Our experiments reveal two key findings for Telugu. First, the choice of tokenizer algorithm is the most significant factor influencing performance, with Unigram-based tokenizers consistently outperforming BPE across most settings. Second, while better morphological alignment shows a moderate, positive correlation with performance on text classification and structure prediction tasks, its impact is secondary to the tokenizer algorithm. Notably, hybrid approaches that use morphological information for pre-segmentation significantly boost the performance of BPE, though not Unigram. Our results further showcase the need for comprehensive intrinsic evaluation metrics for tokenizers that could explain downstream performance trends consistently.
翻译:分词算法(如字节对编码(BPE)、Unigram)、形态对齐、分词质量(如压缩效率)与下游性能之间的关系在很大程度上仍不明确,尤其对于具有复杂形态的语言。本文中,我们使用小型 BERT 模型——从预训练到微调——对泰卢固语(黏着语)的分词器进行了全面评估,并初步评估了印地语(主要为屈折语,带有部分黏着特征)和英语(屈折语)。为评估泰卢固语分词器的形态对齐情况,我们创建了一个数据集,包含 600 个派生词形和 7000 个屈折词形的黄金语素切分。我们的实验揭示了泰卢固语的两个关键发现。首先,分词算法的选择是影响性能的最重要因素,基于 Unigram 的分词器在大多数设置中始终优于 BPE。其次,虽然更好的形态对齐在文本分类和结构预测任务上表现出中等程度的正相关,但其影响次于分词算法。值得注意的是,利用形态信息进行预切分的混合方法显著提升了 BPE 的性能,但对 Unigram 则无此效果。我们的结果进一步表明,需要建立全面的分词器内在评估指标,以一致地解释下游性能趋势。