Voice communication in bandwidth-constrained environments--maritime, satellite, and tactical networks--remains prohibitively expensive. Traditional codecs struggle below 1 kbps, while existing semantic approaches (STT-TTS) sacrifice prosody and speaker identity. We present STCTS, a generative semantic compression framework enabling natural voice communication at 80 bps. STCTS explicitly decomposes speech into linguistic content, prosodic expression, and speaker timbre, applying tailored compression: context-aware text encoding (70 bps), sparse prosody transmission via TTS interpolation (<14 bps at 0.1-1 Hz), and amortized speaker embedding. Evaluations on LibriSpeech demonstrate a 75x bitrate reduction versus Opus (6 kbps) and 12x versus EnCodec (1 kbps), while maintaining perceptual quality (NISQA MOS > 4.26), graceful degradation under packet loss and noise resilience. We also discover a bimodal quality distribution with prosody sampling rate: sparse and dense updates both achieve high quality, while mid-range rates degrade due to perceptual discontinuities--guiding optimal configuration design. Beyond efficiency, our modular architecture supports privacy-preserving encryption, human-interpretable transmission, and flexible deployment on edge devices, offering a robust solution for ultra-low bandwidth scenarios.
翻译:在带宽受限的环境(如海事、卫星和战术网络)中,语音通信的成本仍然极高。传统编解码器在低于1 kbps的比特率下表现不佳,而现有的语义方法(STT-TTS)则牺牲了韵律和说话人身份特征。本文提出STCTS,一种生成式语义压缩框架,能够在80 bps的比特率下实现自然的语音通信。STCTS将语音显式分解为语言内容、韵律表达和说话人音色,并采用定制化压缩策略:上下文感知的文本编码(70 bps)、通过TTS插值实现的稀疏韵律传输(更新频率0.1-1 Hz,比特率<14 bps)以及摊销式说话人嵌入。在LibriSpeech数据集上的评估表明,与Opus(6 kbps)相比,STCTS实现了75倍的比特率降低,与EnCodec(1 kbps)相比实现了12倍的降低,同时保持了感知质量(NISQA MOS > 4.26),并在丢包和噪声环境下表现出优雅的性能退化与鲁棒性。我们还发现,韵律采样率与质量呈现双峰分布:稀疏更新和密集更新均能实现高质量,而中等更新率则因感知不连续性导致质量下降——这为最优配置设计提供了指导。除了高效性外,我们的模块化架构支持隐私保护加密、人类可解释的传输以及在边缘设备上的灵活部署,为超低带宽场景提供了稳健的解决方案。