While transformer-based architectures have taken computer vision and NLP by storm, they often require a vast amount of parameters and training data to attain strong performance. In this work, we experiment with three distinct pre-training, intermediate fine-tuning, and downstream datasets and training objectives to explore their marginal benefits on a small 5M-parameter vision transformer. We find that while pre-training and fine-tuning always help our model but have diminishing returns, intermediate fine-tuning can actually show harmful impact on downstream performance, potentially due to dissimilarity in task mechanics. Taken together, our results suggest that small-scale ViTs benefit most from targeted pre-training and careful data selection, while indiscriminate stacking of intermediate tasks can waste compute and even degrade performance.
翻译:尽管基于Transformer的架构已在计算机视觉和自然语言处理领域掀起革命,但它们通常需要大量参数和训练数据才能实现强劲性能。本研究通过三个不同的预训练、中间微调和下游数据集及训练目标进行实验,探究它们对一个小型500万参数视觉Transformer的边际效益。我们发现,虽然预训练和微调始终有助于模型性能,但存在收益递减现象;而中间微调可能因任务机制差异对下游性能产生负面影响。综合而言,我们的结果表明:小规模视觉Transformer最能从针对性预训练和谨慎的数据选择中获益,而无差别堆叠中间任务可能浪费计算资源甚至导致性能下降。