Replacing hand-engineered pipelines with end-to-end deep learning systems has enabled strong results in applications like speech and object recognition. However, the causality and latency constraints of production systems put end-to-end speech models back into the underfitting regime and expose biases in the model that we show cannot be overcome by "scaling up", i.e., training bigger models on more data. In this work we systematically identify and address sources of bias, reducing error rates by up to 20% while remaining practical for deployment. We achieve this by utilizing improved neural architectures for streaming inference, solving optimization issues, and employing strategies that increase audio and label modelling versatility.
翻译:以端至端深层学习系统取代手工制造的管道,在语音和对象识别等应用中取得了强有力的成果。然而,由于生产系统的因果关系和长期限制,终端至端语音模型又回到了不完善的状态,暴露了我们所显示的模式中的偏见,无法通过“扩大”来克服,也就是说,在更多数据方面培训更大的模型。在这项工作中,我们系统地查明并解决偏差源,将误差率降低20%,同时仍然可以实际使用。我们通过利用改进的神经结构来进行传导,解决优化问题,以及采用提高音频和标签建模多功能的战略来实现这一目标。