We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
翻译:本文提出TimeViper,一种旨在应对长视频理解挑战的混合视觉语言模型。处理长视频既需要高效的模型架构,也需要处理扩展时序上下文的有效机制。为此,TimeViper采用混合Mamba-Transformer骨干网络,将状态空间模型的效率与注意力机制的表达能力相结合。通过这种混合设计,我们揭示了视觉到文本的信息聚合现象:随着大语言模型深度的增加,信息从视觉标记逐步流向文本标记,导致严重的视觉标记冗余。基于这一观察,我们提出TransV——一种标记信息传递模块,该模块将视觉标记转移并压缩至指令标记,同时保持多模态理解能力。这一设计使TimeViper能够处理超过10,000帧、长达一小时的视频。在多个基准测试上的广泛实验表明,TimeViper在扩展处理帧数的同时,其性能可与最先进模型相媲美。我们进一步分析了Mamba层与Transformer层的注意力行为,为混合模型的可解释性研究提供了新见解。本工作代表了开发、解释和压缩混合Mamba-Transformer架构的初步探索。