Traditional ASR metrics like WER and CER fail to capture intelligibility, especially for dysarthric and dysphonic speech, where semantic alignment matters more than exact word matches. ASR systems struggle with these speech types, often producing errors like phoneme repetitions and imprecise consonants, yet the meaning remains clear to human listeners. We identify two key challenges: (1) Existing metrics do not adequately reflect intelligibility, and (2) while LLMs can refine ASR output, their effectiveness in correcting ASR transcripts of dysarthric speech remains underexplored. To address this, we propose a novel metric integrating Natural Language Inference (NLI) scores, semantic similarity, and phonetic similarity. Our ASR evaluation metric achieves a 0.890 correlation with human judgments on Speech Accessibility Project data, surpassing traditional methods and emphasizing the need to prioritize intelligibility over error-based measures.
翻译:传统的ASR度量指标,如词错误率(WER)和字符错误率(CER),未能有效捕捉可懂度,尤其是在构音障碍和发声障碍语音中,语义对齐比精确的词语匹配更为重要。ASR系统在处理此类语音时面临困难,常产生音素重复和辅音不准确等错误,但其含义对人类听者而言仍清晰可辨。我们识别出两大关键挑战:(1)现有指标未能充分反映可懂度;(2)尽管大型语言模型(LLM)能够优化ASR输出,但其在纠正构音障碍语音ASR转录方面的有效性仍未得到充分探索。为此,我们提出了一种新颖的度量方法,整合了自然语言推理(NLI)分数、语义相似度和语音相似度。我们的ASR评估指标在Speech Accessibility Project数据上与人类判断的相关性达到0.890,超越了传统方法,并强调了优先考虑可懂度而非基于错误度量的必要性。