CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long, detailed captions. In this work, we propose $β$-CLIP, a multi-granular text-conditioned contrastive learning framework designed to achieve hierarchical alignment between multiple textual granularities-from full captions to sentences and phrases-and their corresponding visual regions. For each level of granularity, $β$-CLIP utilizes cross-attention to dynamically pool image patches, producing contextualized visual embeddings. To address the semantic overlap inherent in this hierarchy, we introduce the $β$-Contextualized Contrastive Alignment Loss ($β$-CAL). This objective parameterizes the trade-off between strict query-specific matching and relaxed intra-image contextualization, supporting both soft Cross-Entropy and hard Binary Cross-Entropy formulations. Through extensive experiments, we demonstrate that $β$-CLIP significantly improves dense alignment: achieving 91.8% T2I 92.3% I2T at R@1 on Urban1K and 30.9% on FG-OVD (Hard), setting state-of-the-art among methods trained without hard negatives. $β$-CLIP establishes a robust, adaptive baseline for dense vision-language correspondence. The code and models are released at https://github.com/fzohra/B-CLIP.
翻译:CLIP通过对齐全局视觉与文本表示实现了强大的零样本图像-文本检索能力,但在细粒度任务上即使使用长篇幅详细描述进行微调仍表现不足。本文提出β-CLIP——一种多粒度文本条件对比学习框架,旨在实现从完整描述到句子、短语的多层次文本粒度与其对应视觉区域的分级对齐。针对每个粒度层级,β-CLIP利用交叉注意力动态聚合图像块,生成上下文感知的视觉嵌入。为应对该层级结构中固有的语义重叠问题,我们提出β-上下文对比对齐损失(β-CAL)。该目标函数通过参数化严格查询特定匹配与宽松图像内上下文化之间的权衡,同时支持软交叉熵与硬二元交叉熵两种形式。大量实验表明,β-CLIP显著提升了密集对齐性能:在Urban1K数据集上达到91.8% T2I和92.3% I2T的R@1指标,在FG-OVD(Hard)任务上取得30.9%的准确率,在无需困难负样本训练的方法中达到最优水平。β-CLIP为密集视觉-语言对应任务建立了鲁棒且自适应的基准。代码与模型已发布于https://github.com/fzohra/B-CLIP。