In this paper, we propose a novel task termed Omni-Referring Image Segmentation (OmniRIS) towards highly generalized image segmentation. Compared with existing unimodally conditioned segmentation tasks, such as RIS and visual RIS, OmniRIS supports the input of text instructions and reference images with masks, boxes or scribbles as omni-prompts. This property makes it can well exploit the intrinsic merits of both text and visual modalities, i.e., granular attribute referring and uncommon object grounding, respectively. Besides, OmniRIS can also handle various segmentation settings, such as one v.s. many and many v.s. many, further facilitating its practical use. To promote the research of OmniRIS, we also rigorously design and construct a large dataset termed OmniRef, which consists of 186,939 omni-prompts for 30,956 images, and establish a comprehensive evaluation system. Moreover, a strong and general baseline termed OmniSegNet is also proposed to tackle the key challenges of OmniRIS, such as omni-prompt encoding. The extensive experiments not only validate the capability of OmniSegNet in following omni-modal instructions, but also show the superiority of OmniRIS for highly generalized image segmentation.
翻译:本文提出了一项新颖的任务,称为全指代图像分割(OmniRIS),旨在实现高度泛化的图像分割。与现有的单模态条件分割任务(如RIS和视觉RIS)相比,OmniRIS支持输入文本指令以及带有掩码、边界框或涂鸦的参考图像作为全提示。这一特性使其能够充分利用文本和视觉模态的内在优势,即分别实现细粒度属性指代和罕见对象定位。此外,OmniRIS还能处理多种分割设置,例如一对多和多对多,进一步提升了其实用性。为促进OmniRIS的研究,我们严谨地设计并构建了一个大型数据集OmniRef,包含30,956张图像的186,939个全提示,并建立了全面的评估体系。同时,提出了一种强大且通用的基线模型OmniSegNet,以应对OmniRIS的关键挑战,如全提示编码。大量实验不仅验证了OmniSegNet在遵循全模态指令方面的能力,还展示了OmniRIS在高度泛化图像分割中的优越性。