There is growing evidence that pretrained language models improve task-specific fine-tuning not just for the languages seen in pretraining, but also for new languages and even non-linguistic data. What is the nature of this surprising cross-domain transfer? We offer a partial answer via a systematic exploration of how much transfer occurs when models are denied any information about word identity via random scrambling. In four classification tasks and two sequence labeling tasks, we evaluate baseline models, LSTMs using GloVe embeddings, and BERT. We find that only BERT shows high rates of transfer into our scrambled domains, and for classification but not sequence labeling tasks. Our analyses seek to explain why transfer succeeds for some tasks but not others, to isolate the separate contributions of pretraining versus fine-tuning, and to quantify the role of word frequency. These findings help explain where and why cross-domain transfer occurs, which can guide future studies and practical fine-tuning efforts.
翻译:越来越多的证据表明,预先培训的语言模式不仅改进了培训前所看到的语言,而且改进了针对特定任务的微调,不仅改进了新语言的微调,甚至改进了非语言数据。这种令人惊讶的跨域传输的性质是什么?我们通过系统探索在模式被随机拼拼法拒绝时,会发生多少转移,提供部分答案。在四个分类任务和两个序列标签任务中,我们评估了基准模型,即使用GloVe嵌入的LSTMs和BERT。我们发现,只有BERT显示向我们分散域的传输率高,分类而不是顺序标识任务。我们的分析试图解释为什么某些任务转移成功,而不是其他任务成功,将培训前与微调的不同贡献分开,并量化文字频率的作用。这些发现有助于解释跨域传输的发生地点和原因,可以指导今后的研究和实用微调努力。