The Unigram tokenization algorithm offers a probabilistic alternative to the greedy heuristics of Byte-Pair Encoding. Despite its theoretical elegance, its implementation in practice is complex, limiting its adoption to the SentencePiece package and adapters thereof. We bridge this gap between theory and practice by providing a clear guide to implementation and parameter choices. We also identify a simpler algorithm that accepts slightly higher training loss in exchange for improved compression.
翻译:Unigram分词算法为字节对编码的贪婪启发式方法提供了一种概率替代方案。尽管其理论优雅,但实际实现复杂,限制了其仅被SentencePiece包及其适配器采用。我们通过提供清晰的实现指南与参数选择建议,弥合了理论与实践之间的鸿沟。同时,我们提出了一种更简化的算法,该算法以略微增加训练损失为代价,换取了更优的压缩性能。