Despite the remarkable success of the LLaVA architecture for vision-language tasks, its design inherently struggles to effectively integrate visual features due to the inherent mismatch between text and vision modalities. We tackle this issue from a novel perspective in which the LLM not only serves as a language model but also a powerful vision encoder. To this end, we present LLaViT - Large Language Models as extended Vision Transformers - which enables the LLM to simultaneously function as a vision encoder through three key modifications: (1) learning separate QKV projections for vision modality, (2) enabling bidirectional attention on visual tokens, and (3) incorporating both global and local visual representations. Through extensive controlled experiments on a wide range of LLMs, we demonstrate that LLaViT significantly outperforms the baseline LLaVA method on a multitude of benchmarks, even surpassing models with double its parameter count, establishing a more effective approach to vision-language modeling.
翻译:尽管LLaVA架构在视觉-语言任务上取得了显著成功,但其设计因文本与视觉模态之间的固有失配,本质上难以有效整合视觉特征。我们从一种新颖的视角来解决这一问题,即让大语言模型不仅作为语言模型,同时也充当强大的视觉编码器。为此,我们提出了LLaViT——扩展视觉Transformer的大语言模型——通过三项关键修改,使大语言模型能够同时作为视觉编码器运行:(1)为视觉模态学习独立的QKV投影,(2)实现对视觉标记的双向注意力,(3)融合全局与局部视觉表征。通过对多种大语言模型进行广泛的受控实验,我们证明LLaViT在多项基准测试中显著优于基线LLaVA方法,甚至超越了参数量为其两倍的模型,从而为视觉-语言建模确立了一种更有效的途径。