Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [CLS] token, originally designed as a global aggregation hub, in favor of distributed consensus among patch tokens. We quantify patterns of information mixing with an Information Scrambling Index, and show that in ViT-L the information-task tradeoff emerges roughly 10 layers later than in ViT-B, and that these additional layers correlate with increased information diffusion rather than improved task performance. Taken together, these results suggest that transformer architectures in this regime may benefit more from carefully calibrated depth that executes clean phase transitions than from simply increasing parameter count. The Information Scrambling Index provides a useful diagnostic for existing models and suggests a potential design target for future architectures. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb.
翻译:更深的视觉Transformer(Vision Transformers)往往表现不如较浅的模型,这挑战了常见的缩放假设。通过对ViT-S、ViT-B和ViT-L在ImageNet上的系统实证分析,我们发现了一个一致的‘悬崖-平台-爬升’三阶段模式,该模式主导了表征随深度演化的过程。我们观察到,更好的性能与[CLS]令牌(最初设计为全局聚合中心)的逐步边缘化相关,取而代之的是补丁令牌之间的分布式共识。我们使用信息扰动指数(Information Scrambling Index)量化了信息混合的模式,并表明在ViT-L中,信息-任务权衡的出现比ViT-B大约晚10层,且这些额外层与信息扩散增加相关,而非任务性能的提升。综合来看,这些结果表明,在此机制下的Transformer架构可能更受益于精心校准的深度(以实现清晰的相变),而非简单地增加参数数量。信息扰动指数为现有模型提供了有用的诊断工具,并为未来架构的设计目标提供了潜在方向。所有代码可在以下网址获取:https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb。