Multimodal recommendation systems utilize various types of information, including images and text, to enhance the effectiveness of recommendations. The key challenge is predicting user purchasing behavior from the available data. Current recommendation models prioritize extracting multimodal information while neglecting the distinction between redundant and valuable data. They also rely heavily on a single semantic framework (e.g., local or global semantics), resulting in an incomplete or biased representation of user preferences, particularly those less expressed in prior interactions. Furthermore, these approaches fail to capture the complex interactions between users and items, limiting the model's ability to meet diverse users. To address these challenges, we present SRGFormer, a structurally optimized multimodal recommendation model. By modifying the transformer for better integration into our model, we capture the overall behavior patterns of users. Then, we enhance structural information by embedding multimodal information into a hypergraph structure to aid in learning the local structures between users and items. Meanwhile, applying self-supervised tasks to user-item collaborative signals enhances the integration of multimodal information, thereby revealing the representational features inherent to the data's modality. Extensive experiments on three public datasets reveal that SRGFormer surpasses previous benchmark models, achieving an average performance improvement of 4.47 percent on the Sports dataset. The code is publicly available online.
翻译:多模态推荐系统利用图像、文本等多种类型信息以提升推荐效果,其核心挑战在于从可用数据中预测用户的购买行为。现有推荐模型侧重于提取多模态信息,却忽略了冗余数据与有价值数据之间的区分,且过度依赖单一语义框架(如局部或全局语义),导致对用户偏好的表征不完整或存在偏差,尤其对那些在先验交互中表达较少的偏好。此外,这些方法未能捕捉用户与物品之间的复杂交互,限制了模型满足多样化用户需求的能力。为应对这些挑战,我们提出了SRGFormer,一种结构优化的多模态推荐模型。通过改进Transformer以更好地融入模型,我们捕获了用户的整体行为模式。随后,通过将多模态信息嵌入超图结构以增强结构信息,辅助学习用户与物品之间的局部结构。同时,对用户-物品协同信号应用自监督任务,加强了多模态信息的整合,从而揭示数据模态固有的表征特征。在三个公开数据集上的大量实验表明,SRGFormer超越了以往的基准模型,在Sports数据集上实现了平均4.47%的性能提升。相关代码已公开在线提供。