This paper revisits the neural vocoder task through the lens of audio restoration and propose a novel diffusion vocoder called BridgeVoC. Specifically, by rank analysis, we compare the rank characteristics of Mel-spectrum with other common acoustic degradation factors, and cast the vocoder task as a specialized case of audio restoration, where the range-space spectral (RSS) surrogate of the target spectrum acts as the degraded input. Based on that, we introduce the Schrodinger bridge framework for diffusion modeling, which defines the RSS and target spectrum as dual endpoints of the stochastic generation trajectory. Further, to fully utilize the hierarchical prior of subbands in the time-frequency (T-F) domain, we elaborately devise a novel subband-aware convolutional diffusion network as the data predictor, where subbands are divided following an uneven strategy, and convolutional-style attention module is employed with large kernels for efficient T-F contextual modeling. To enable single-step inference, we propose an omnidirectional distillation loss to facilitate effective information transfer from the teacher model to the student model, and the performance is improved by combining target-related and bijective consistency losses. Comprehensive experiments are conducted on various benchmarks and out-of-distribution datasets. Quantitative and qualitative results show that while enjoying fewer parameters, lower computational cost, and competitive inference speed, the proposed BridgeVoC yields stateof-the-art performance over existing advanced GAN-, DDPMand flow-matching-based baselines with only 4 sampling steps. And consistent superiority is still achieved with single-step inference.
翻译:本文从音频修复的视角重新审视神经声码器任务,并提出一种名为BridgeVoC的新型扩散声码器。具体而言,通过秩分析,我们比较了梅尔频谱与其他常见声学退化因素的秩特性,并将声码器任务视为音频修复的一种特例,其中目标频谱的秩空间谱(RSS)替代作为退化输入。基于此,我们引入薛定谔桥框架进行扩散建模,将RSS与目标频谱定义为随机生成轨迹的双端点。进一步,为充分利用时频(T-F)域子带的层次先验,我们精心设计了一种新颖的子带感知卷积扩散网络作为数据预测器,其中子带采用非均匀策略划分,并采用大核卷积式注意力模块以实现高效的T-F上下文建模。为实现单步推理,我们提出了一种全方位蒸馏损失,以促进从教师模型到学生模型的有效信息传递,并通过结合目标相关损失和双射一致性损失来提升性能。我们在多种基准测试和分布外数据集上进行了全面实验。定量与定性结果表明,所提出的BridgeVoC在仅需4个采样步数的情况下,以更少的参数量、更低的计算成本和具有竞争力的推理速度,在现有基于GAN、DDPM和流匹配的先进基线模型上实现了最先进的性能。即使在单步推理下,仍能保持一致的优越性。