We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities. Self-supervised learning from speech signals aims to learn the latent structure inherent in the signal, while self-supervised learning from text attempts to capture lexical information. Learning aligned representations from unpaired speech and text sequences is a challenging task. Previous work either implicitly enforced the representations learnt from these two modalities to be aligned in the latent space through multitasking and parameter sharing or explicitly through conversion of modalities via speech synthesis. While the former suffers from interference between the two modalities, the latter introduces additional complexity. In this paper, we propose Maestro, a novel algorithm to learn unified representations from both these modalities simultaneously that can transfer to diverse downstream tasks such as Automated Speech Recognition (ASR) and Speech Translation (ST). Maestro learns unified representations through sequence alignment, duration prediction and matching embeddings in the learned space through an aligned masked-language model loss. We establish a new state-of-the-art (SOTA) on VoxPopuli multilingual ASR with a 8% relative reduction in Word Error Rate (WER), multidomain SpeechStew ASR (3.7% relative) and 21 languages to English multilingual ST on CoVoST 2 with an improvement of 2.8 BLEU averaged over 21 languages.
翻译:我们介绍大师,这是一个自我监督的培训方法,以统一从语言和文本模式中学会的表达方式;自我监督地从语言信号中学习,目的是学习信号中固有的潜在结构,而自我监督地学习从文字试图获取词汇信息的过程中学习;学习从不受重视的言语和文本顺序中学习一致的表述是一项艰巨的任务;以前的工作要么隐含地执行从这两种模式中学到的通过多重任务和参数共享或通过语言合成方式转换模式在潜空间中调整的表述方式;前者受到两种模式之间的干扰,后者带来更多的复杂性;在本文件中,我们建议大師采用一种新颖的算法,从这两种模式中学习统一的表达方式,既可以同时转换到各种下游任务,例如自动语音识别和语音翻译(ST)。 大师通过顺序调整、持续时间预测和匹配在学习空间中通过统一的隐蔽语言模式损失,将这两种模式结合到隐蔽空间中。我们在Vox-Populi多语言ASR(SSR)上建立一个新的状态,在文字错误率上相对减少8%的状态(ER-SR),多式SEOVER-21 相对性语言上改进了BSUSVSV(3),在21调校方语言上的一种语言上的一种多语言上。