Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model's robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at https://github.com/naver-ai/maskris.
翻译:指代图像分割(RIS)是一项先进的视觉-语言任务,旨在根据自由形式的文本描述识别并分割图像中的目标对象。尽管先前研究主要关注视觉与语言特征的对齐,但针对训练技术(如数据增强)的探索仍显不足。本文深入探究了适用于RIS的有效数据增强方法,并提出了一种名为掩码指代图像分割(MaskRIS)的新型训练框架。我们观察到,传统的图像增强方法在RIS任务中效果有限,甚至会导致性能下降,而简单的随机掩码操作却能显著提升RIS性能。MaskRIS同时采用图像掩码与文本掩码,并结合失真感知上下文学习(DCL)策略,以充分发挥掩码策略的优势。该方法能有效提升模型对遮挡、信息不完整及多种语言复杂场景的鲁棒性,从而实现显著的性能提升。实验表明,MaskRIS可轻松应用于多种RIS模型,在全监督与弱监督设置下均优于现有方法。最终,MaskRIS在RefCOCO、RefCOCO+和RefCOCOg数据集上取得了新的最优性能。代码已开源:https://github.com/naver-ai/maskris。