Table images present unique challenges for effective and efficient understanding due to the need for question-specific focus and the presence of redundant background regions. Existing Multimodal Large Language Model (MLLM) approaches often overlook these characteristics, resulting in uninformative and redundant visual representations. To address these issues, we aim to generate visual features that are both informative and compact to improve table understanding. We first propose progressive question conditioning, which injects the question into Vision Transformer layers with gradually increasing frequency, considering each layer's capacity to handle additional information, to generate question-aware visual features. To reduce redundancy, we introduce a pruning strategy that discards background tokens, thereby improving efficiency. To mitigate information loss from pruning, we further propose token focusing, a training strategy that encourages the model to concentrate essential information in the retained tokens. By combining these approaches, we present TabFlash, an efficient and effective MLLM for table understanding. TabFlash achieves state-of-the-art performance, outperforming both open-source and proprietary MLLMs, while requiring 27% less FLOPs and 30% less memory usage compared to the second-best MLLM.
翻译:表格图像因其需要针对特定问题进行聚焦以及存在冗余背景区域的特点,给高效且有效的理解带来了独特挑战。现有的多模态大语言模型方法往往忽视了这些特性,导致生成的视觉表征信息量不足且冗余。为解决这些问题,我们旨在生成既信息丰富又紧凑的视觉特征,以提升表格理解能力。我们首先提出渐进式问题条件化方法,该方法根据各视觉Transformer层处理额外信息的能力,以逐渐增加的频率将问题信息注入到各层中,从而生成问题感知的视觉特征。为减少冗余,我们引入了一种剪枝策略,通过丢弃背景令牌来提高效率。为缓解剪枝带来的信息损失,我们进一步提出了令牌聚焦训练策略,该策略促使模型将关键信息集中保留在剩余的令牌中。通过整合这些方法,我们提出了TabFlash——一个用于表格理解的高效且有效的多模态大语言模型。TabFlash实现了最先进的性能,超越了开源和专有的多模态大语言模型,同时与性能次优的模型相比,所需的浮点运算量减少了27%,内存使用量降低了30%。