Neural network based speech recognition systems suffer from performance degradation due to accented speech, especially unfamiliar accents. In this paper, we study the supervised contrastive learning framework for accented speech recognition. To build different views (similar "positive" data samples) for contrastive learning, three data augmentation techniques including noise injection, spectrogram augmentation and TTS-same-sentence generation are further investigated. From the experiments on the Common Voice dataset, we have shown that contrastive learning helps to build data-augmentation invariant and pronunciation invariant representations, which significantly outperforms traditional joint training methods in both zero-shot and full-shot settings. Experiments show that contrastive learning can improve accuracy by 3.66% (zero-shot) and 3.78% (full-shot) on average, comparing to the joint training method.
翻译:基于神经网络的语音识别系统因口音,特别是不熟悉的口音而出现性能退化。 在本文中,我们研究了用于口音识别的受监督的对比学习框架。为了为对比性学习建立不同的观点(相似的“积极”数据样本 ), 进一步调查了三种数据增强技术, 包括噪音注入、 光谱增强和 TTS- 同一刑罚生成。 在共同语音数据集的实验中, 我们发现, 有对比的学习有助于构建数据增强变异和发音表达方式, 这在零发和全发环境中都大大超过传统的联合培训方法。 实验显示,对比学习可以平均提高3.66%(零发)和3.78%(全发)的准确度, 与联合培训方法相比。