YingMusic-SVC：基于Flow-GRPO与歌唱特定归纳偏置的鲁棒零样本歌唱声音转换 (YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases)

Singing voice conversion (SVC) aims to render the target singer's timbre while preserving melody and lyrics. However, existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing. We propose YingMusic-SVC, a robust zero-shot framework that unifies continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. Our model introduces a singing-trained RVC timbre shifter for timbre-content disentanglement, an F0-aware timbre adaptor for dynamic vocal expression, and an energy-balanced rectified flow matching loss to enhance high-frequency fidelity. Experiments on a graded multi-track benchmark show that YingMusic-SVC achieves consistent improvements over strong open-source baselines in timbre similarity, intelligibility, and perceptual naturalness, especially under accompanied and harmony-contaminated conditions, demonstrating its effectiveness for real-world SVC deployment.

翻译：歌唱声音转换（SVC）旨在保留旋律与歌词的同时，呈现目标歌手的音色特征。然而，现有零样本SVC系统在实际歌曲中因和声干扰、基频（F0）误差及缺乏歌唱相关归纳偏置而表现脆弱。本文提出YingMusic-SVC，一个鲁棒的零样本框架，整合了连续预训练、鲁棒监督微调与Flow-GRPO强化学习。该模型引入经歌唱训练的RVC音色转换器以实现音色与内容解耦，采用F0感知音色适配器以捕捉动态声乐表达，并设计能量平衡整流流匹配损失以增强高频保真度。在分级多音轨基准测试上的实验表明，YingMusic-SVC在音色相似度、清晰度与感知自然度上均优于现有开源基线模型，尤其在伴奏及和声干扰场景下表现突出，证明了其在实际SVC部署中的有效性。