Vision foundation models achieve strong performance on both global and locally dense downstream tasks. Pretrained on large images, the recent DINOv3 model family is able to produce very fine-grained dense feature maps, enabling state-of-the-art performance. However, computing these feature maps requires the input image to be available at very high resolution, as well as large amounts of compute due to the squared complexity of the transformer architecture. To address these issues, we propose BRIXEL, a simple knowledge distillation approach that has the student learn to reproduce its own feature maps at higher resolution. Despite its simplicity, BRIXEL outperforms the baseline DINOv3 models by large margins on downstream tasks when the resolution is kept fixed. Moreover, it is able to produce feature maps that are very similar to those of the teacher at a fraction of the computational cost. Code and model weights are available at https://github.com/alexanderlappe/BRIXEL.
翻译:视觉基础模型在全局和局部密集下游任务中均表现出色。近期基于大图像预训练的DINOv3模型系列能够生成极其精细的密集特征图,实现了最先进的性能。然而,由于Transformer架构的平方复杂度,计算这些特征图不仅需要极高分辨率的输入图像,还消耗大量计算资源。为解决这些问题,我们提出BRIXEL——一种简单的知识蒸馏方法,让学生模型学习在更高分辨率下复现自身的特征图。尽管方法简单,但在固定分辨率条件下,BRIXEL在下游任务中的表现大幅超越基线DINOv3模型。此外,该方法能以极低计算成本生成与教师模型高度相似的特征图。代码与模型权重已发布于https://github.com/alexanderlappe/BRIXEL。