India is home to multiple languages, and training automatic speech recognition (ASR) systems for languages is challenging. Over time, each language has adopted words from other languages, such as English, leading to code-mixing. Most Indian languages also have their own unique scripts, which poses a major limitation in training multilingual and code-switching ASR systems. Inspired by results in text-to-speech synthesis, in this work, we use an in-house rule-based phoneme-level common label set (CLS) representation to train multilingual and code-switching ASR for Indian languages. We propose two end-to-end (E2E) ASR systems. In the first system, the E2E model is trained on the CLS representation, and we use a novel data-driven back-end to recover the native language script. In the second system, we propose a modification to the E2E model, wherein the CLS representation and the native language characters are used simultaneously for training. We show our results on the multilingual and code-switching tasks of the Indic ASR Challenge 2021. Our best results achieve 6% and 5% improvement (approx) in word error rate over the baseline system for the multilingual and code-switching tasks, respectively, on the challenge development data.
翻译:印度拥有多种语言,培训语言自动语音识别系统具有挑战性。随着时间推移,每种语言都采用了英语等其他语言的文字,导致代码混合。大多数印度语言也有自己的独特的脚本,这在培训多语种和代码转换 ASR系统方面是一个重大限制。在文本对语音合成结果的启发下,我们在这个工作中使用内部基于规则的电话级通用标签(CLS)来培训印度语言的多语种和代码转换 ASR(CLS)系统。我们建议使用两种终端对终端(E2E) ASR系统。在第一个系统中,E2E模式是自己独特的脚本,这在培训多语种语言和代码转换系统方面构成了一个独特的限制。我们用新的数据驱动后端来恢复本地语言文字。在第二个系统中,我们建议对E2E模式进行修改,其中CLS代表和本地语言字符同时用于培训。我们展示了我们在Indic ASR挑战的多语种和代码转换任务(E2E)中两个端到终端(E2E) ASR系统,我们的最佳结果在CLS 基准中分别实现了6 % 和5 % 改进了数据。