The computational cost of training multimodal large language models (MLLMs) grows rapidly with the number of processed tokens. Existing efficiency methods mainly target inference via token reduction or merging, offering limited benefits during training. We introduce ReGATE (Reference-Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. ReGATE adopts a teacher-student framework, in which a frozen teacher LLM provides per-token guidance losses that are fused with an exponential moving average of the student's difficulty estimates. This adaptive scoring mechanism dynamically selects informative tokens while skipping redundant ones in the forward pass, substantially reducing computation without altering the model architecture. Across three representative MLLMs, ReGATE matches the peak accuracy of standard training on MVBench up to 2$\times$ faster, using only 38% of the tokens. With extended training, it even surpasses the baseline across multiple multimodal benchmarks, cutting total token usage by over 41%. Code and models will be released publicly.
翻译:多模态大语言模型(MLLMs)的训练计算成本随处理token数量的增加而急剧增长。现有的效率提升方法主要通过token缩减或合并来优化推理过程,对训练阶段的加速效果有限。本文提出ReGATE(参考引导的自适应token剪枝),一种用于加速MLLM训练的自适应token剪枝方法。ReGATE采用师生框架,其中冻结的教师LLM提供逐token的引导损失,并与学生模型难度估计的指数移动平均值相融合。这种自适应评分机制在前向传播过程中动态选择信息丰富的token,同时跳过冗余token,在不改变模型架构的情况下显著减少计算量。在三个代表性MLLM上的实验表明,ReGATE在MVBench上仅使用38%的token,就能以最高2倍的速度达到标准训练的峰值准确率。通过延长训练,该方法在多个多模态基准测试中甚至超越基线性能,总体token使用量减少超过41%。代码与模型将公开发布。