Video Large Language Models have demonstrated strong video understanding capabilities, yet their practical deployment is hindered by substantial inference costs caused by redundant video tokens. Existing pruning techniques fail to effectively exploit the spatiotemporal redundancy present in video data. To bridge this gap, we perform a systematic analysis of video redundancy from two perspectives: temporal context and visual context. Leveraging these insights, we propose Dynamic Density Pruning for Fast Video LLMs termed FastVID. Specifically, FastVID dynamically partitions videos into temporally ordered segments to preserve temporal structure and applies a density-based token pruning strategy to maintain essential spatial and temporal information. Our method significantly reduces computational overhead while maintaining temporal and visual integrity. Extensive evaluations show that FastVID achieves state-of-the-art performance across various short- and long-video benchmarks on leading Video LLMs, including LLaVA-OneVision, LLaVA-Video, Qwen2-VL, and Qwen2.5-VL. Notably, on LLaVA-OneVision-7B, FastVID effectively prunes $\textbf{90.3%}$ of video tokens, reduces FLOPs to $\textbf{8.3%}$, and accelerates the LLM prefill stage by $\textbf{7.1}\times$, while maintaining $\textbf{98.0%}$ of the original accuracy. The code is available at https://github.com/LunarShen/FastVID.
翻译:视频大语言模型已展现出强大的视频理解能力,但其实际部署受到冗余视频标记所导致的巨大推理成本阻碍。现有剪枝技术未能有效利用视频数据中存在的时空冗余。为弥补这一差距,我们从两个视角对视频冗余进行了系统性分析:时间上下文与视觉上下文。基于这些洞见,我们提出了名为FastVID的动态密度剪枝方法,用于实现快速视频大语言模型。具体而言,FastVID动态地将视频划分为时序排列的片段以保持时间结构,并应用基于密度的标记剪枝策略来保留关键的空间与时间信息。我们的方法在维持时间与视觉完整性的同时,显著降低了计算开销。大量评估表明,FastVID在包括LLaVA-OneVision、LLaVA-Video、Qwen2-VL和Qwen2.5-VL在内的主流视频大语言模型上,于各类短/长视频基准测试中均达到了最先进的性能。值得注意的是,在LLaVA-OneVision-7B模型上,FastVID有效剪除了$\textbf{90.3%}$的视频标记,将FLOPs降低至$\textbf{8.3%}$,并将LLM预填充阶段加速$\textbf{7.1}\times$,同时保持了$\textbf{98.0%}$的原始准确率。代码已发布于https://github.com/LunarShen/FastVID。