As posts on social media increase rapidly, analyzing the sentiments embedded in image-text pairs has become a popular research topic in recent years. Although existing works achieve impressive accomplishments in simultaneously harnessing image and text information, they lack the considerations of possible low-quality and missing modalities. In real-world applications, these issues might frequently occur, leading to urgent needs for models capable of predicting sentiment robustly. Therefore, we propose a Distribution-based feature Recovery and Fusion (DRF) method for robust multimodal sentiment analysis of image-text pairs. Specifically, we maintain a feature queue for each modality to approximate their feature distributions, through which we can simultaneously handle low-quality and missing modalities in a unified framework. For low-quality modalities, we reduce their contributions to the fusion by quantitatively estimating modality qualities based on the distributions. For missing modalities, we build inter-modal mapping relationships supervised by samples and distributions, thereby recovering the missing modalities from available ones. In experiments, two disruption strategies that corrupt and discard some modalities in samples are adopted to mimic the low-quality and missing modalities in various real-world scenarios. Through comprehensive experiments on three publicly available image-text datasets, we demonstrate the universal improvements of DRF compared to SOTA methods under both two strategies, validating its effectiveness in robust multimodal sentiment analysis.
翻译:随着社交媒体帖子的快速增长,分析图像-文本对中蕴含的情感已成为近年来的热门研究课题。尽管现有研究在同时利用图像和文本信息方面取得了显著成果,但缺乏对可能存在的低质量模态和缺失模态的考量。在实际应用中,这些问题可能频繁发生,导致对能够鲁棒预测情感的模型产生迫切需求。为此,我们提出了一种基于分布的特征恢复与融合方法,用于图像-文本对的鲁棒多模态情感分析。具体而言,我们为每个模态维护一个特征队列以近似其特征分布,通过该框架可同时处理低质量模态和缺失模态。对于低质量模态,我们基于分布定量估计模态质量,从而降低其对融合的贡献。对于缺失模态,我们构建由样本和分布监督的跨模态映射关系,从而从可用模态中恢复缺失模态。实验中,采用两种干扰策略——破坏和丢弃样本中的部分模态——以模拟各种现实场景中的低质量与缺失模态。通过在三个公开可用的图像-文本数据集上进行综合实验,我们证明了DRF方法相较于当前最优方法在两种策略下均能实现普遍性能提升,验证了其在鲁棒多模态情感分析中的有效性。