Semantic communications for multi-modal data can transmit task-relevant information efficiently over noisy and bandwidth-limited channels. However, a key challenge is to simultaneously compress inter-modal redundancy and improve semantic reliability under channel distortion. To address the challenge, we propose a robust and efficient multi-modal task-oriented communication framework that integrates a two-stage variational information bottleneck (VIB) with mutual information (MI) redundancy minimization. In the first stage, we apply uni-modal VIB to compress each modality separately, i.e., text, audio, and video, while preserving task-specific features. To enhance efficiency, an MI minimization module with adversarial training is then used to suppress cross-modal dependencies and to promote complementarity rather than redundancy. In the second stage, a multi-modal VIB is further used to compress the fused representation and to enhance robustness against channel distortion. Experimental results on multi-modal emotion recognition tasks demonstrate that the proposed framework significantly outperforms existing baselines in accuracy and reliability, particularly under low signal-to-noise ratio regimes. Our work provides a principled framework that jointly optimizes modality-specific compression, inter-modal redundancy, and communication reliability.
翻译:多模态数据的语义通信能够在噪声和带宽受限的信道上高效传输任务相关信息。然而,一个关键挑战在于如何在信道失真条件下同时压缩模态间冗余并提升语义可靠性。为解决该挑战,我们提出了一种鲁棒且高效的多模态任务导向通信框架,该框架将两阶段变分信息瓶颈与互信息冗余最小化相结合。在第一阶段,我们应用单模态变分信息瓶颈分别压缩各模态(即文本、音频和视频),同时保留任务特定特征。为提升效率,随后采用基于对抗训练的互信息最小化模块来抑制跨模态依赖性,并促进互补性而非冗余性。在第二阶段,进一步使用多模态变分信息瓶颈压缩融合表征,并增强对信道失真的鲁棒性。在多模态情感识别任务上的实验结果表明,所提框架在准确性和可靠性方面显著优于现有基线方法,尤其在低信噪比条件下表现突出。本研究提供了一个可联合优化模态特定压缩、模态间冗余与通信可靠性的理论框架。