The effectiveness of Contrastive Language-Image Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on increasing data volume, such emphasis often leads to limited semantic diversity and redundant or shallow captions. To address this limitation, we propose Role-SynthCLIP, a novel data synthesis framework that leverages multi-perspective role-playing prompts (e.g., a compositional analyst, an interpreter of image context) to guide Multimodal Large Language Models (MLLMs) in generating semantically diverse captions from distinct viewpoints. This mechanism enhances the semantic diversity and fine-grained image-text alignment of synthetic pairs, thereby improving caption expressiveness and accuracy while keeping the total number of image-text pairs unchanged. Experimental results demonstrate the effectiveness and efficiency of our method. A CLIP-B/16 model trained on only 1 million Role-SynthCLIP pairs achieves a Recall@1 of 64.1% on the MS COCO validation set, surpassing the best existing synthetic data baseline (trained on 5M pairs) by 2.8 percentage points. The code and trained models are released at https://github.com/huangfu170/Role-SynthCLIP.
翻译:对比语言-图像预训练(CLIP)模型的有效性关键取决于其训练数据的语义多样性与质量。然而,现有的合成数据生成方法主要侧重于增加数据量,这种侧重往往导致语义多样性有限以及描述文本冗余或肤浅。为解决这一局限,我们提出了Role-SynthCLIP,一种新颖的数据合成框架,它利用多视角角色扮演提示(例如,作为构图分析师、图像语境解释者)来引导多模态大语言模型(MLLMs)从不同视角生成语义多样的描述文本。该机制增强了合成图像-文本对的语义多样性和细粒度图文对齐,从而在保持图像-文本对总数不变的同时,提升了描述的丰富性和准确性。实验结果表明了我们方法的有效性和高效性。仅使用100万对Role-SynthCLIP数据训练的CLIP-B/16模型,在MS COCO验证集上实现了64.1%的Recall@1,比现有最佳合成数据基线(使用500万对数据训练)高出2.8个百分点。代码及训练模型发布于https://github.com/huangfu170/Role-SynthCLIP。