Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into five recent MLLMs and validated on a set of long-video benchmarks. Experimental results not only show the obvious performance gains by OneClip-RAG over MLLMs, e.g., boosting InternLV2 8B and Qwen2-VL 7B to the level of GPT-4o on MLVU, but also show its superior efficiency in handling long videos. e.g., enabling LLaVA-Video understand up to an hour of videos in less than 2.2 minutes on a single 4090 GPU.
翻译:由于内存开销过大,大多数多模态大语言模型(MLLMs)仅能处理有限帧数的视频。本文提出一种有效且高效的范式来弥补这一不足,称为基于单次视频片段检索的增强理解方法(OneClip-RAG)。与现有视频检索增强生成方法相比,OneClip-RAG充分利用了视频片段在知识完整性和语义连贯性方面的优势,以实现增强的视频理解。此外,该方法还配备了一种新颖的查询引导视频分块算法,能够将片段分块与跨模态检索统一在一个处理步骤中,从而避免冗余计算。为提升指令跟随能力,我们进一步提出了名为SynLongVideo的新数据集,并为OneClip-RAG设计了渐进式训练策略。OneClip-RAG被集成到五个近期提出的MLLMs中,并在多个长视频基准测试上进行了验证。实验结果表明,OneClip-RAG不仅显著提升了MLLMs的性能(例如,在MLVU基准上将InternLV2 8B和Qwen2-VL 7B提升至GPT-4o的水平),而且在处理长视频时展现出卓越的效率(例如,使LLaVA-Video在单张4090 GPU上能在2.2分钟内理解长达一小时的视频)。