Although deep learning based methods have achieved great success in many computer vision tasks, their performance relies on a large number of densely annotated samples that are typically difficult to obtain. In this paper, we focus on the problem of learning representation from unlabeled data for semantic segmentation. Inspired by two patch-based methods, we develop a novel self-supervised learning framework by formulating the Jigsaw Puzzle problem as a patch-wise classification process and solving it with a fully convolutional network. By learning to solve a Jigsaw Puzzle problem with 25 patches and transferring the learned features to semantic segmentation task on Cityscapes dataset, we achieve a 5.8 percentage point improvement over the baseline model that initialized from random values. Moreover, experiments show that our self-supervised learning method can be applied to different datasets and models. In particular, we achieved competitive performance with the state-of-the-art methods on the PASCAL VOC2012 dataset using significant fewer training images.
翻译:虽然深层次的学习方法在许多计算机愿景任务中取得了巨大成功,但其性能依赖于大量通常难以获得的密集加注样本。 在本文中,我们侧重于从未贴标签的数据中学习用于语义分解的代表性问题。在两种补丁法的启发下,我们开发了一种新的自我监督学习框架,将Jigsaw 拼字游戏问题作为一个补丁分类过程,并用完全革命性网络加以解决。通过学习用25个补丁来解决Jigsaw 拼字游戏问题,并将所学到的特征转换为城市景数据集的语义分解任务,我们从随机值开始的基线模型上取得了5.8个百分点的改进。此外,实验还表明,我们自我监督的学习方法可以适用于不同的数据集和模型。特别是,我们利用少得多的培训图像,在PASCAL VOC2012数据集上采用最先进的方法取得了竞争性业绩。