Text-to-Image Person Retrieval (TIPR) aims to retrieve person images based on natural language descriptions. Although many TIPR methods have achieved promising results, sometimes textual queries cannot accurately and comprehensively reflect the content of the image, leading to poor cross-modal alignment and overfitting to limited datasets. Moreover, the inherent modality gap between text and image further amplifies these issues, making accurate cross-modal retrieval even more challenging. To address these limitations, we propose the Generation-Enhanced Alignment (GEA) from a generative perspective. GEA contains two parallel modules: (1) Text-Guided Token Enhancement (TGTE), which introduces diffusion-generated images as intermediate semantic representations to bridge the gap between text and visual patterns. These generated images enrich the semantic representation of text and facilitate cross-modal alignment. (2) Generative Intermediate Fusion (GIF), which combines cross-attention between generated images, original images, and text features to generate a unified representation optimized by triplet alignment loss. We conduct extensive experiments on three public TIPR datasets, CUHK-PEDES, RSTPReid, and ICFG-PEDES, to evaluate the performance of GEA. The results justify the effectiveness of our method. More implementation details and extended results are available at https://github.com/sugelamyd123/Sup-for-GEA.
翻译:文本到图像行人检索(TIPR)旨在基于自然语言描述检索行人图像。尽管已有许多TIPR方法取得了显著成果,但文本查询有时无法准确且全面地反映图像内容,导致跨模态对齐效果不佳以及对有限数据集的过拟合。此外,文本与图像之间固有的模态差异进一步加剧了这些问题,使得精确的跨模态检索更具挑战性。为应对这些局限,我们从生成视角提出了生成增强对齐(GEA)方法。GEA包含两个并行模块:(1)文本引导的标记增强(TGTE),通过引入扩散生成图像作为中间语义表示,以弥合文本与视觉模式之间的差距。这些生成图像丰富了文本的语义表示,并促进了跨模态对齐。(2)生成式中间融合(GIF),结合生成图像、原始图像与文本特征之间的交叉注意力,生成通过三元组对齐损失优化的统一表示。我们在三个公开的TIPR数据集(CUHK-PEDES、RSTPReid和ICFG-PEDES)上进行了广泛实验以评估GEA的性能。结果验证了我们方法的有效性。更多实现细节与扩展结果请访问https://github.com/sugelamyd123/Sup-for-GEA。