ShortV：通过在无效层中冻结视觉标记实现高效多模态大语言模型 (ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers)

Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV

翻译：多模态大语言模型（MLLMs）因其庞大的参数量以及大量的视觉标记而面临高昂的计算成本。本文通过引入一种新颖的指标——层贡献度（Layer Contribution, LC），来研究MLLMs中的层间冗余性，该指标分别量化了某一层对视觉标记和文本标记的变换所产生的影响。LC的计算涉及测量移除该层对指定标记的变换后模型输出的差异。我们的初步实验表明，MLLMs的许多层在处理视觉标记时贡献度极低。基于这一观察，我们提出了ShortV，这是一种无需训练的方法，它利用LC来识别无效层，并在这些层中冻结视觉标记的更新。实验表明，ShortV能够在约60%的MLLM层中冻结视觉标记，从而显著降低与更新视觉标记相关的计算成本。例如，在LLaVA-NeXT-13B上，它实现了50%的浮点运算量（FLOPs）减少，同时保持了优异的性能。代码将在https://github.com/icip-cas/ShortV 公开提供。