可微分层次化视觉分词 (Differentiable Hierarchical Visual Tokenization) - 专知论文

会员服务 ·

0

分词 · 分层 · 语义结构 · 结构 · 端到端 ·

Differentiable Hierarchical Visual Tokenization

翻译：可微分层次化视觉分词

Marius Aasan,Martine Hjelkrem-Tan,Nico Catalano,Changkyu Choi,Adín Ramírez Rivera

from arxiv, NeurIPS 2025 Spotlight

Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.

翻译：视觉Transformer依赖固定的图像块分词，忽略了图像的空间与语义结构。本文提出一种端到端的可微分分词器，能够以像素级粒度自适应图像内容，同时保持与现有架构的向后兼容性以适配预训练模型。该方法采用基于信息准则的层次化模型选择机制，在图像级分类与密集预测任务中均展现出竞争力，并支持开箱即用的栅格-矢量转换功能。

0

相关内容

将一个汉字序列切分成一个一个单独的词

【ICML2023】SEGA:结构熵引导的图对比学习锚视图

【ICML2023】SEGA:结构熵引导的图对比学习锚视图

专知会员服务

22+阅读 · 2023年5月10日

【ICML2021】来自观察的跨域模仿

【ICML2021】来自观察的跨域模仿

专知会员服务

18+阅读 · 2021年5月25日

【CVPR2021】基于Transformer的视频分割领域

【CVPR2021】基于Transformer的视频分割领域

专知会员服务

38+阅读 · 2021年4月16日

【ICLR2021】基于图信息瓶颈的子图识别

专知会员服务

19+阅读 · 2021年2月8日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知会员服务

112+阅读 · 2019年11月25日

【ICML2020】多视角对比图表示学习，Contrastive Multi-View GRL

【ICML2020】多视角对比图表示学习，Contrastive Multi-View GRL

专知

37+阅读 · 2020年6月11日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

初学者系列：Deep FM详解

初学者系列：Deep FM详解

专知

109+阅读 · 2019年8月26日

论文笔记之Feature Selective Networks for Object Detection

论文笔记之Feature Selective Networks for Object Detection

统计学习与视觉计算组

21+阅读 · 2018年7月26日

读论文Discriminative Deep Metric Learning for Face and KV

读论文Discriminative Deep Metric Learning for Face and KV

统计学习与视觉计算组

12+阅读 · 2018年4月6日

粗糙回归模型与算法研究

国家自然科学基金

8+阅读 · 2015年12月31日

基于稀疏表达理论和RGBD图像的人脸表情识别

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

Web页面数据对象的感知理解与计算

国家自然科学基金

0+阅读 · 2014年12月31日

高维数据下的模型平均方法

国家自然科学基金

6+阅读 · 2014年12月31日

Counting Clues: A Lightweight Probabilistic Baseline Can Match an LLM

Arxiv

0+阅读 · 12月14日

The Relative Monadic Metalanguage

Arxiv

0+阅读 · 12月12日

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Arxiv

0+阅读 · 12月1日

Multi-Personality Generation of LLMs at Decoding-time

Arxiv

0+阅读 · 11月17日

Language-Aided State Estimation

Arxiv

0+阅读 · 11月14日

VIP会员

文章信息

相关主题

相关VIP内容

【ICML2023】SEGA:结构熵引导的图对比学习锚视图

【ICML2023】SEGA:结构熵引导的图对比学习锚视图

专知会员服务

22+阅读 · 2023年5月10日

【ICML2021】来自观察的跨域模仿

【ICML2021】来自观察的跨域模仿

专知会员服务

18+阅读 · 2021年5月25日

【CVPR2021】基于Transformer的视频分割领域

【CVPR2021】基于Transformer的视频分割领域

专知会员服务

38+阅读 · 2021年4月16日

【ICLR2021】基于图信息瓶颈的子图识别

专知会员服务

19+阅读 · 2021年2月8日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知会员服务

112+阅读 · 2019年11月25日

热门VIP内容

开通专知VIP会员享更多权益服务

【MIT博士论文】弱监督学习：理论、方法与应用

Andrej Karpathy：2025 年 LLM 年度回顾（2025 LLM Year in Review）

锚定情报：合成欺骗时代的地面真相

NeurIPS 2025 | NMKE：基于神经元归因与动态稀疏掩码的终身知识编辑

相关资讯

【ICML2020】多视角对比图表示学习，Contrastive Multi-View GRL

【ICML2020】多视角对比图表示学习，Contrastive Multi-View GRL

专知

37+阅读 · 2020年6月11日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

初学者系列：Deep FM详解

初学者系列：Deep FM详解

专知

109+阅读 · 2019年8月26日

论文笔记之Feature Selective Networks for Object Detection

论文笔记之Feature Selective Networks for Object Detection

统计学习与视觉计算组

21+阅读 · 2018年7月26日

读论文Discriminative Deep Metric Learning for Face and KV

读论文Discriminative Deep Metric Learning for Face and KV

统计学习与视觉计算组

12+阅读 · 2018年4月6日

相关论文

Counting Clues: A Lightweight Probabilistic Baseline Can Match an LLM

Arxiv

0+阅读 · 12月14日

The Relative Monadic Metalanguage

Arxiv

0+阅读 · 12月12日

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Arxiv

0+阅读 · 12月1日

Multi-Personality Generation of LLMs at Decoding-time

Arxiv

0+阅读 · 11月17日

Language-Aided State Estimation

Arxiv

0+阅读 · 11月14日

相关基金

粗糙回归模型与算法研究

国家自然科学基金

8+阅读 · 2015年12月31日

基于稀疏表达理论和RGBD图像的人脸表情识别

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

Web页面数据对象的感知理解与计算

国家自然科学基金

0+阅读 · 2014年12月31日

高维数据下的模型平均方法

国家自然科学基金

6+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员