STCTS：基于显式文本-韵律-音色分解的超低比特率语音生成式语义压缩 (STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition)

Voice communication in bandwidth-constrained environments--maritime, satellite, and tactical networks--remains prohibitively expensive. Traditional codecs struggle below 1 kbps, while existing semantic approaches (STT-TTS) sacrifice prosody and speaker identity. We present STCTS, a generative semantic compression framework enabling natural voice communication at approximately 80 bps. STCTS explicitly decomposes speech into linguistic content, prosodic expression, and speaker timbre, applying tailored compression: context-aware text encoding (approximately 70 bps), sparse prosody transmission via TTS interpolation (less than 14 bps at 0.1-1 Hz), and amortized speaker embedding. Evaluations on LibriSpeech demonstrate a 75x bitrate reduction versus Opus (6 kbps) and 12x versus EnCodec (1 kbps), while maintaining perceptual quality (NISQA MOS greater than 4.26). We also discover a bimodal quality distribution with prosody sampling rate: sparse and dense updates both achieve high quality, while mid-range rates degrade due to perceptual discontinuities--guiding optimal configuration design. Beyond efficiency, our modular architecture supports privacy-preserving encryption, human-interpretable transmission, and flexible deployment on edge devices, offering a robust solution for ultra-low bandwidth scenarios.

翻译：在带宽受限的环境（如海事、卫星和战术网络）中，语音通信的成本仍然过高。传统编解码器在低于1 kbps的速率下表现不佳，而现有的语义方法（STT-TTS）则牺牲了韵律和说话人身份特征。本文提出STCTS，一种生成式语义压缩框架，能够在约80 bps的比特率下实现自然的语音通信。STCTS将语音显式分解为语言内容、韵律表达和说话人音色，并采用定制化压缩策略：基于上下文感知的文本编码（约70 bps）、通过TTS插值进行稀疏韵律传输（在0.1-1 Hz下低于14 bps）以及摊销式说话人嵌入。在LibriSpeech数据集上的评估表明，与Opus（6 kbps）相比，STCTS实现了75倍的比特率降低，与EnCodec（1 kbps）相比实现了12倍的降低，同时保持了感知质量（NISQA MOS大于4.26）。我们还发现，韵律采样率呈现双峰质量分布：稀疏和密集更新均能实现高质量，而中等采样率则因感知不连续性导致质量下降——这为最优配置设计提供了指导。除了高效性外，我们的模块化架构还支持隐私保护加密、人类可解释传输以及在边缘设备上的灵活部署，为超低带宽场景提供了稳健的解决方案。