野外多种语言视觉语音识别 (Visual Speech Recognition for Multiple Languages in the Wild)

Visual speech recognition (VSR) aims to recognize the content of speech based on lip movements, without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to the larger training sets rather than the model design. Here we demonstrate that designing better models is equally as important as using larger training sets. We propose the addition of prediction-based auxiliary tasks to a VSR model, and highlight the importance of hyperparameter optimization and appropriate data augmentations. We show that such a model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin. It even outperforms models that were trained on non-publicly available datasets containing up to to 21 times more data. We show, furthermore, that using additional training data, even in other languages or with automatically generated transcriptions, results in further improvement.

翻译：视觉语音识别(VSR)旨在承认基于嘴唇运动的语音内容,而不必依赖音频流。深层次学习的进步和大型视听数据集的提供,导致开发比以往任何时候更准确、更稳健的VSR模型。然而,这些进步通常归功于规模更大的培训组,而不是模型设计。在这里,我们证明设计更好的模型与使用更大的培训组同样重要。我们提议在VSR模型中增加基于预测的辅助任务,并强调超参数优化和适当数据增强的重要性。我们表明,这种模型适用于不同语言,并且超越了以前在大幅度的公开数据集方面受过培训的所有方法。甚至比在非公开数据集方面受过培训的模型还要差,而这些数据包含多达21倍的数据。我们还要表明,使用额外的培训数据,即使是以其他语言或自动生成的誊本,都会导致进一步的改进。

相关内容

MoDELS

关注 0

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日