Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. A CLAP-based style encoder (Style-CLAP) aligns cues from reference audio and descriptive text in a shared space and is trained with contrastive learning plus multi-task supervision on style attributes. For fine-grained control during inference, we introduce chained classifier-free guidance (cCFG) trained with hierarchical condition dropout, enabling independent adjustment of content, timbre, and style guidance strengths. Additionally, we employ Representation Alignment (REPA) to distill acoustic-semantic features from a pretrained Whisper model into intermediate DiT representations, stabilizing training and accelerating convergence. Experiments show that DMP-TTS delivers stronger style controllability than open-source baselines while maintaining competitive intelligibility and naturalness. Code and demos will be available at https://y61329697.github.io/DMP-TTS/.
翻译:可控文本到语音(TTS)系统在实现说话人音色与说话风格的独立操控方面面临显著挑战,常因属性间的纠缠而受限。本文提出DMP-TTS,一种具备显式解耦与多模态提示能力的潜在扩散Transformer(DiT)框架。基于CLAP的风格编码器(Style-CLAP)将参考音频与描述性文本的线索对齐至共享空间,并通过对比学习结合风格属性的多任务监督进行训练。为实现推理过程中的细粒度控制,我们引入基于分层条件丢弃训练的链式无分类器引导(cCFG),支持内容、音色和风格引导强度的独立调节。此外,我们采用表示对齐(REPA)方法,将预训练Whisper模型中的声学-语义特征蒸馏至中间DiT表示中,从而稳定训练并加速收敛。实验表明,DMP-TTS在保持可理解性与自然度竞争力的同时,相比开源基线实现了更强的风格可控性。代码与演示将发布于https://y61329697.github.io/DMP-TTS/。