Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS) systems, converting written forms into their canonical spoken equivalents. Traditional TN systems can exhibit high accuracy, but involve substantial engineering effort, are difficult to scale, and pose challenges to language coverage, particularly in low-resource settings. We propose PolyNorm, a prompt-based approach to TN using Large Language Models (LLMs), aiming to reduce the reliance on manually crafted rules and enable broader linguistic applicability with minimal human intervention. Additionally, we present a language-agnostic pipeline for automatic data curation and evaluation, designed to facilitate scalable experimentation across diverse languages. Experiments across eight languages show consistent reductions in the word error rate (WER) compared to a production-grade-based system. To support further research, we release PolyNorm-Benchmark, a multilingual data set covering a diverse range of text normalization phenomena.
翻译:文本归一化(TN)是文本转语音(TTS)系统中的关键预处理步骤,负责将书面形式转换为其规范的口语等价形式。传统的文本归一化系统虽然可以达到较高的准确率,但需要大量的工程投入,难以扩展,并且在语言覆盖方面面临挑战,尤其在低资源场景下。我们提出了PolyNorm,一种基于提示的大语言模型(LLM)文本归一化方法,旨在减少对手工编写规则的依赖,并以最少的人工干预实现更广泛的语言适用性。此外,我们提出了一种语言无关的自动数据整理与评估流程,旨在促进跨多种语言的可扩展实验。在八种语言上的实验表明,与基于生产级系统相比,该方法在词错误率(WER)上实现了持续降低。为支持进一步研究,我们发布了PolyNorm-Benchmark,这是一个涵盖多样化文本归一化现象的多语言数据集。