The rapid expansion of online fashion platforms has created an increasing demand for intelligent recommender systems capable of understanding both visual and textual cues. This paper proposes a hybrid multimodal deep learning framework for fashion recommendation that jointly addresses two key tasks: outfit compatibility prediction and complementary item retrieval. The model leverages the visual and textual encoders of the CLIP architecture to obtain joint latent representations of fashion items, which are then integrated into a unified feature vector and processed by a transformer encoder. For compatibility prediction, an "outfit token" is introduced to model the holistic relationships among items, achieving an AUC of 0.95 on the Polyvore dataset. For complementary item retrieval, a "target item token" representing the desired item description is used to retrieve compatible items, reaching an accuracy of 69.24% under the Fill-in-the-Blank (FITB) metric. The proposed approach demonstrates strong performance across both tasks, highlighting the effectiveness of multimodal learning for fashion recommendation.
翻译:在线时尚平台的快速扩张,催生了对于能够同时理解视觉与文本线索的智能推荐系统的日益增长需求。本文提出了一种用于时尚推荐的混合多模态深度学习框架,该框架联合处理两个关键任务:穿搭兼容性预测与互补单品检索。该模型利用CLIP架构的视觉与文本编码器获取时尚物品的联合潜在表示,随后将其整合为统一特征向量,并由Transformer编码器进行处理。在兼容性预测任务中,引入“穿搭标记”以建模物品间的整体关系,在Polyvore数据集上实现了0.95的AUC。在互补单品检索任务中,使用代表目标物品描述的“目标物品标记”来检索兼容物品,在填空(FITB)指标下达到了69.24%的准确率。所提出的方法在两项任务上均展现出强劲性能,凸显了多模态学习在时尚推荐中的有效性。