Singing voice synthesis (SVS) has advanced significantly, enabling models to generate vocals with accurate pitch and consistent style. As these capabilities improve, the need for reliable evaluation and optimization becomes increasingly critical. However, current methods like reward systems often rely on single numerical scores, struggle to capture various dimensions such as phrasing or expressiveness, and require costly annotations, limiting interpretability and generalization. To address these issues, we propose a generative feedback (i.e., reward model) framework that provides multi-dimensional language and audio feedback for SVS assessment. Our approach leverages an audio-language model to generate text and audio critiques-covering aspects such as melody, content, and auditory quality. The model is fine-tuned on a hybrid dataset combining human music reactions and synthetic critiques from a MLLMs, enhancing diversity and linguistic richness. Quantitative experiments validate the effectiveness of the proposed dataset and training strategy, demonstrating that the framework produces musically accurate and interpretable evaluations suitable for guiding generative model improvement. The code is at [https://github.com/opendilab/VocalCritic](https://github.com/opendilab/VocalCritic)
翻译:歌声合成技术已取得显著进展,使得模型能够生成音高准确且风格一致的声乐。随着这些能力的提升,对可靠评估与优化的需求变得日益关键。然而,当前方法(如奖励系统)通常依赖单一数值评分,难以捕捉乐句处理或表现力等多维度特征,且需要昂贵的标注,限制了可解释性与泛化能力。为解决这些问题,我们提出了一种生成式反馈(即奖励模型)框架,为歌声合成评估提供多维度的语言与音频反馈。该方法利用音频-语言模型生成涵盖旋律、内容及听觉质量等方面的文本与音频评述。模型在结合人类音乐反应与来自多模态大语言模型合成评述的混合数据集上进行微调,以增强多样性与语言丰富性。定量实验验证了所提数据集与训练策略的有效性,表明该框架能产生音乐上准确且可解释的评估结果,适用于指导生成模型的改进。代码发布于[https://github.com/opendilab/VocalCritic](https://github.com/opendilab/VocalCritic)。