This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn).
翻译:本研究提出了GLM-TTS,一个面向生产环境、兼顾效率、可控性与高保真语音生成的文本转语音系统。GLM-TTS采用两阶段架构,包含文本到语音标记的自回归模型和语音标记到波形的扩散模型。仅使用10万小时训练数据,GLM-TTS便在多个开源基准测试中取得了最先进的性能。为满足生产需求,GLM-TTS通过以下方式提升语音质量:采用基频约束优化的语音标记器,以及基于GRPO的多奖励强化学习框架,该框架联合优化发音、说话人相似度和富有表现力的韵律特征。同时,系统通过基于参数高效的LoRA语音定制方案和混合音素-文本输入策略实现高效可控的部署,后者可提供精确的发音控制。我们的代码发布于https://github.com/zai-org/GLM-TTS。实时语音合成演示可通过Z.ai(audio.z.ai)及智谱清言应用/网页端(chatglm.cn)获取。