We address the fundamental incompatibility of attention-based encoder-decoder (AED) models with long-form acoustic encodings. AED models trained on segmented utterances learn to encode absolute frame positions by exploiting limited acoustic context beyond segment boundaries, but fail to generalize when decoding long-form segments where these cues vanish. The model loses ability to order acoustic encodings due to permutation invariance of keys and values in cross-attention. We propose four modifications: (1) injecting explicit absolute positional encodings into cross-attention for each decoded segment, (2) long-form training with extended acoustic context to eliminate implicit absolute position encoding, (3) segment concatenation to cover diverse segmentations needed during training, and (4) semantic segmentation to align AED-decoded segments with training segments. We show these modifications close the accuracy gap between continuous and segmented acoustic encodings, enabling auto-regressive use of the attention decoder.
翻译:本文针对基于注意力的编码器-解码器(AED)模型与长时声学编码之间的根本性不兼容问题展开研究。在分段语音上训练的AED模型通过利用片段边界之外的有限声学上下文来学习编码绝对帧位置,但在解码这些线索消失的长时片段时无法有效泛化。由于交叉注意力中键与值的排列不变性,模型失去了对声学编码进行排序的能力。我们提出了四项改进措施:(1)为每个解码片段在交叉注意力中注入显式的绝对位置编码;(2)采用扩展声学上下文的长时训练以消除隐式绝对位置编码;(3)通过片段拼接覆盖训练所需的不同分段方式;(4)使用语义分割使AED解码片段与训练片段对齐。实验表明,这些改进显著缩小了连续声学编码与分段声学编码之间的精度差距,从而实现了注意力解码器的自回归应用。