Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we have proposed a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments demonstrate that in the highly challenging multi-factor compositional speech generation task, MF-Speech significantly outperforms current state-of-the-art methods, achieving a lower word error rate (WER=4.67%), superior style control (SECS=0.5685, Corr=0.68), and the highest subjective evaluation scores(nMOS=3.96, sMOS_emotion=3.86, sMOS_style=3.78). Furthermore, the learned discrete factors exhibit strong transferability, demonstrating their significant potential as a general-purpose speech representation.
翻译:生成富有表现力且可控的人类语音是生成式人工智能的核心目标之一,但其进展长期受限于两个基本挑战:语音因子的深度纠缠以及现有控制机制的粗粒度特性。为克服这些挑战,我们提出了一种名为MF-Speech的新型框架,该框架包含两个核心组件:MF-SpeechEncoder与MF-SpeechGenerator。MF-SpeechEncoder作为因子净化器,采用多目标优化策略将原始语音信号分解为高度纯净且独立的内容、音色与情感表征。随后,MF-SpeechGenerator发挥指挥者功能,通过动态融合与分层风格自适应归一化(HSAN)技术,实现对上述因子的精准、可组合且细粒度的控制。实验表明,在极具挑战性的多因子组合语音生成任务中,MF-Speech显著优于当前最先进方法,实现了更低的词错误率(WER=4.67%)、更优的风格控制(SECS=0.5685,Corr=0.68)以及最高的主观评价得分(nMOS=3.96,sMOS_emotion=3.86,sMOS_style=3.78)。此外,学习到的离散因子展现出强大的可迁移性,证明了其作为通用语音表征的重要潜力。