Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness--one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.
翻译:将大型生成模型与人类反馈对齐是一项关键挑战。在语音合成领域,由于缺乏大规模的人类偏好数据集,这一问题尤为突出,阻碍了真正符合人类感知的模型的发展。为解决此问题,我们提出了SpeechJudge,这是一个围绕自然度(语音合成最基本的主观指标之一)构建的综合性套件,包含数据集、基准测试和奖励模型。首先,我们介绍了SpeechJudge-Data,一个包含99K个语音对的大规模人类反馈语料库。该数据集通过使用多种先进的零样本文本到语音(TTS)模型,覆盖多样化的语音风格和多种语言构建,并包含人类对可懂度和自然度偏好的标注。基于此,我们建立了SpeechJudge-Eval,一个针对语音自然度评判的挑战性基准。我们的评估显示,现有指标和AudioLLMs在此任务上表现不佳;领先模型Gemini-2.5-Flash与人类判断的一致性低于70%,凸显了显著的改进空间。为弥合这一差距,我们开发了SpeechJudge-GRM,一个基于Qwen2.5-Omni-7B的生成式奖励模型(GRM)。该模型通过两阶段后训练过程在SpeechJudge-Data上进行训练:首先进行带有思维链推理的监督微调(SFT),随后在困难案例上使用GRPO进行强化学习(RL)。在SpeechJudge-Eval基准测试中,所提出的SpeechJudge-GRM表现出优越性能,达到了77.2%的准确率(推理时缩放@10后为79.4%),优于经典的Bradley-Terry奖励模型(72.7%)。此外,SpeechJudge-GRM还可作为奖励函数,在语音生成模型的后训练中促进其与人类偏好的对齐。