Transformer-based models have led to a significant innovation in various classic and practical subjects, including speech processing, natural language processing, and computer vision. On top of the transformer, the attention-based end-to-end automatic speech recognition (ASR) models have become a popular fashion in recent years. Specifically, the non-autoregressive modeling, which can achieve fast inference speed and comparable performance when compared to conventional autoregressive methods, is an emergent research topic. In the context of natural language processing, the bidirectional encoder representations from transformers (BERT) model has received widespread attention, partially due to its ability to infer contextualized word representations and to obtain superior performances of downstream tasks by performing only simple fine-tuning. In order to not only inherit the advantages of non-autoregressive ASR modeling, but also receive benefits from a pre-trained language model (e.g., BERT), a non-autoregressive transformer-based end-to-end ASR model based on BERT is presented in this paper. A series of experiments conducted on the AISHELL-1 dataset demonstrates competitive or superior results of the proposed model when compared to state-of-the-art ASR systems.
翻译:在变压器之外,基于关注的端到端自动语音识别模型近年来已成为流行的模式。具体地说,与传统的自动递减方法相比,非偏向型模型能够实现快速推导速度和可比性能,它是一个新兴的研究课题。在自然语言处理方面,变压器的双向变压器的双向编码器演示得到了广泛的注意,部分原因是它能够推断背景化的字面表述,并且能够通过仅仅进行简单的微调获得下游任务的优异性能。为了不仅继承非偏向型语音识别模型的优势,而且还从预先培训的语言模型(例如,BERT)中获益。在自然语言处理方面,变压器模型的端到端,基于BERT的ASR模型得到了广泛的关注。在AISELL-1号模型上进行的一系列实验,展示了在AISELL-1号数据系统比较优劣的情况下,AISALL-1号模型展示了高水平数据系统。