Voice communication in bandwidth-constrained environments--maritime, satellite, and tactical networks--remains prohibitively expensive. Traditional codecs struggle below 1 kbps, while existing semantic approaches (STT-TTS) sacrifice prosody and speaker identity. We present STCTS, a generative semantic compression framework enabling natural voice communication at approximately 80 bps. STCTS explicitly decomposes speech into linguistic content, prosodic expression, and speaker timbre, applying tailored compression: context-aware text encoding (approximately 70 bps), sparse prosody transmission via TTS interpolation (less than 14 bps at 0.1-1 Hz), and amortized speaker embedding. Evaluations on LibriSpeech demonstrate a 75x bitrate reduction versus Opus (6 kbps) and 12x versus EnCodec (1 kbps), while maintaining perceptual quality (NISQA MOS greater than 4.26). We also discover a bimodal quality distribution with prosody sampling rate: sparse and dense updates both achieve high quality, while mid-range rates degrade due to perceptual discontinuities--guiding optimal configuration design. Beyond efficiency, our modular architecture supports privacy-preserving encryption, human-interpretable transmission, and flexible deployment on edge devices, offering a robust solution for ultra-low bandwidth scenarios.
翻译:在带宽受限的环境(如海事、卫星和战术网络)中,语音通信的成本仍然过高。传统编解码器在低于1 kbps的速率下表现不佳,而现有的语义方法(STT-TTS)则牺牲了韵律和说话人身份特征。本文提出STCTS,一种生成式语义压缩框架,能够在约80 bps的比特率下实现自然的语音通信。STCTS将语音显式分解为语言内容、韵律表达和说话人音色,并采用定制化压缩策略:基于上下文感知的文本编码(约70 bps)、通过TTS插值进行稀疏韵律传输(在0.1-1 Hz下低于14 bps)以及摊销式说话人嵌入。在LibriSpeech数据集上的评估表明,与Opus(6 kbps)相比,STCTS实现了75倍的比特率降低,与EnCodec(1 kbps)相比实现了12倍的降低,同时保持了感知质量(NISQA MOS大于4.26)。我们还发现,韵律采样率呈现双峰质量分布:稀疏和密集更新均能实现高质量,而中等采样率则因感知不连续性导致质量下降——这为最优配置设计提供了指导。除了高效性外,我们的模块化架构还支持隐私保护加密、人类可解释传输以及在边缘设备上的灵活部署,为超低带宽场景提供了稳健的解决方案。