Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.
翻译:视觉Transformer依赖固定的图像块分词,忽略了图像的空间与语义结构。本文提出一种端到端的可微分分词器,能够以像素级粒度自适应图像内容,同时保持与现有架构的向后兼容性以适配预训练模型。该方法采用基于信息准则的层次化模型选择机制,在图像级分类与密集预测任务中均展现出竞争力,并支持开箱即用的栅格-矢量转换功能。