Self-supervised learning (SSL) holds promise in leveraging large amounts of unlabeled data. However, the success of popular SSL methods has limited on single-centric-object images like those in ImageNet and ignores the correlation among the scene and instances, as well as the semantic difference of instances in the scene. To address the above problems, we propose a Unified Self-supervised Visual Pre-training (UniVIP), a novel self-supervised framework to learn versatile visual representations on either single-centric-object or non-iconic dataset. The framework takes into account the representation learning at three levels: 1) the similarity of scene-scene, 2) the correlation of scene-instance, 3) the discrimination of instance-instance. During the learning, we adopt the optimal transport algorithm to automatically measure the discrimination of instances. Massive experiments show that UniVIP pre-trained on non-iconic COCO achieves state-of-the-art transfer performance on a variety of downstream tasks, such as image classification, semi-supervised learning, object detection and segmentation. Furthermore, our method can also exploit single-centric-object dataset such as ImageNet and outperforms BYOL by 2.5% with the same pre-training epochs in linear probing, and surpass current self-supervised object detection methods on COCO dataset, demonstrating its universality and potential.
翻译:自我监督的学习(SSL)在利用大量未贴标签的数据方面很有希望。然而,广受欢迎的SSL方法的成功限制了图像网等单一中心对象图像,并忽略了现场和实例之间的相互关系,以及场景的语义差异。为了解决上述问题,我们提议了一个统一自我监督的视觉预培训(UniviP),这是一个全新的自我监督框架,以学习关于单一中心对象或非气候数据集的灵活视觉表现。这个框架考虑到三个层次的代表性学习:(1) 场景摄像的相似性,(2) 场景摄像的关联性,(3) 实例深入的区别。在学习期间,我们采用了最佳的运输算法,以自动衡量实例的差别。大规模实验表明,UVIP在非气候COCO的早期培训中,在一系列下游任务中,例如图像分类、半监督对象学习、对象探测和分解等,以及场景点观察,3 实例调查。此外,我们采用最佳的运输算法来自动衡量实例。大规模实验表明,UVIPP在各种下游任务中,例如图像分类、半监控对象探测、对象探测和分解的自我分析,在目前图像测试中,还利用单一的自我分析前的自我分析方法。