Zero-shot voice conversion (VC) aims to transfer timbre from a source speaker to any unseen target speaker while preserving linguistic content. Growing application scenarios demand models with streaming inference capabilities. This has created a pressing need for models that are simultaneously fast, lightweight, and high-fidelity. However, existing streaming methods typically rely on either autoregressive (AR) or non-autoregressive (NAR) frameworks, which either require large parameter sizes to achieve strong performance or struggle to generalize to unseen speakers. In this study, we propose MeanVC, a lightweight and streaming zero-shot VC approach. MeanVC introduces a diffusion transformer with a chunk-wise autoregressive denoising strategy, combining the strengths of both AR and NAR paradigms for efficient streaming processing. By introducing mean flows, MeanVC regresses the average velocity field during training, enabling zero-shot VC with superior speech quality and speaker similarity in a single sampling step by directly mapping from the start to the endpoint of the flow trajectory. Additionally, we incorporate diffusion adversarial post-training to mitigate over-smoothing and further enhance speech quality. Experimental results demonstrate that MeanVC significantly outperforms existing zero-shot streaming VC systems, achieving superior conversion quality with higher efficiency and significantly fewer parameters. Audio demos and code are publicly available at https://aslp-lab.github.io/MeanVC.
翻译:零样本语音转换旨在将源说话人的音色迁移至任意未见过的目标说话人,同时保持语言内容不变。日益增长的应用场景要求模型具备流式推理能力,这催生了同时对速度、轻量化和高保真度有迫切需求的模型。然而,现有的流式方法通常依赖于自回归或非自回归框架,这些方法要么需要大量参数以实现强性能,要么难以泛化至未见说话人。本研究提出MeanVC,一种轻量级流式零样本语音转换方法。MeanVC引入了一种采用分块自回归去噪策略的扩散Transformer,结合了自回归与非自回归范式的优势以实现高效流式处理。通过引入均值流,MeanVC在训练过程中回归平均速度场,从而通过直接从流轨迹起点映射至终点,在单次采样步骤中实现具有优异语音质量和说话人相似度的零样本语音转换。此外,我们结合扩散对抗后训练以缓解过度平滑问题,进一步提升语音质量。实验结果表明,MeanVC显著优于现有的零样本流式语音转换系统,以更高的效率和显著更少的参数实现了更优的转换质量。音频示例和代码已公开于https://aslp-lab.github.io/MeanVC。