我的邻居在哪里? (Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer)

Vision Transformers (ViTs) enabled the use of transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effective self-supervised learning (SSL) strategy to train ViTs, that without any external annotation, can significantly improve the results. Specifically, we define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training. Differently from ViT, our RelViT model optimizes all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step. We investigated our proposed methods on several image benchmarks finding that RelViT improves the SSL state-of-the-art methods by a large margin, especially on small datasets.

翻译：视觉变换器(View Trangers)使变压器结构在视觉任务中得以使用,显示在大型数据集培训时令人印象深刻的表现。但是,在相对较小的数据集方面,ViT没有明显的偏差,因此其准确性较低。为此,我们提出了一个简单但依然有效的自我监督学习(SSL)战略来培训ViTs,在没有任何外部注释的情况下,可以大大改善结果。具体地说,我们根据模型在下游训练期间必须事先或共同解决的图像补丁关系,定义了一套SSL任务。不同于ViT,我们的RelViT模型优化了与图像补丁有关的变压器编码器的所有输出符号,从而在每一培训步骤中利用更多的培训信号。我们根据若干图像基准调查了我们提出的方法,发现 RelViT通过很大的空间改进了SS的状态方法,特别是在小数据集上。