Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.
翻译:近年来,视觉语言模型的进展,包括对比预训练和指令微调,极大地推动了多模态人工智能的前沿发展。然而,由于大规模预训练带来的高昂成本,效率问题阻碍了研究者从头开始预训练视觉语言模型的尝试。在本工作中,我们提出了基于可解释池化的动态补丁缩减(DRIP),该方法能够适应输入图像,并在视觉编码器的深层动态合并令牌。我们在ImageNet从头训练和CLIP对比预训练上的结果表明,该方法在保持可比较的分类/零样本性能的同时,显著降低了GFLOP运算量。为进一步验证所提方法,我们在大型生物学数据集上进行了持续预训练,将其影响扩展至科学领域。