许多语音模式的单一自操作模式能够实现零热模式转让 (A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer)

While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled audio-visual data and the cost to deploy one model per modality. In this paper, we present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech with a unified masked cluster prediction objective. By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models. Moreover, our model fine-tuned only on audio can perform well with audio-visual and visual speech input, achieving zero-shot modality generalization for speech recognition and speaker verification. In particular, our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input.

翻译：虽然视听演讲模式与只听音模式相比能够产生优异的性能和稳健性,但由于缺乏贴标签和未贴标签的视听数据以及每个模式部署一个模式的成本,视听演讲模式的发展和采用受到阻碍。在本文中,我们介绍了u-HuBERT,这是一个自我监督的训练前框架,可以利用多式和单式演讲,并有一个统一的蒙面集束预测目标。通过在培训前使用模式辍学,我们证明,一个经过微调的单一模式能够达到与最先进的特定模式模式相同的或更好的性能。此外,我们仅对视听和视觉演讲投入进行微调的模型能够很好地表现,实现语音识别和语音校验的零速模式化。特别是,我们的单一模式产生1.2%/1.4%/27.7%/27.2%的LRS3语言识别错误率,并配有视听/视听/视听投入。