Speech foundation models have recently achieved remarkable capabilities across a wide range of tasks. However, their evaluation remains disjointed across tasks and model types. Different models excel at distinct aspects of speech processing and thus require different evaluation protocols. This paper proposes a unified taxonomy that addresses the question: Which evaluation is appropriate for which model? The taxonomy defines three orthogonal axes: the evaluation aspect being measured, the model capabilities required to attempt the task, and the task or protocol requirements needed to perform it. We classify a broad set of existing evaluations and benchmarks along these axes, spanning areas such as representation learning, speech generation, and interactive dialogue. By mapping each evaluation to the capabilities a model exposes (e.g., speech generation, real-time processing) and to its methodological demands (e.g., fine-tuning data, human judgment), the taxonomy provides a principled framework for aligning models with suitable evaluation methods. It also reveals systematic gaps, such as limited coverage of prosody, interaction, or reasoning, that highlight priorities for future benchmark design. Overall, this work offers a conceptual foundation and practical guide for selecting, interpreting, and extending evaluations of speech models.
翻译:语音基础模型近期在广泛的任务中展现出卓越的性能。然而,其评估方法在不同任务和模型类型之间仍处于割裂状态。不同模型在语音处理的特定方面表现优异,因此需要不同的评估方案。本文提出一个统一的分类学框架,以回答以下问题:何种评估适用于何种模型?该分类学定义了三个正交维度:所衡量的评估方面、模型尝试任务所需具备的能力,以及执行任务所需的任务或方案要求。我们依据这些维度对现有的大量评估方法和基准测试进行分类,涵盖表征学习、语音生成和交互对话等领域。通过将每个评估映射到模型所展现的能力(例如语音生成、实时处理)及其方法学需求(例如微调数据、人工判断),该分类学为模型与适宜评估方法的匹配提供了原则性框架。同时,它揭示了系统性空白,例如在韵律、交互或推理方面的覆盖不足,这为未来基准测试设计指明了优先方向。总体而言,本研究为语音模型评估的选择、解释和扩展提供了概念基础与实践指南。