Recent large language model (LLM) research has undergone an architectural shift from encoder-decoder modeling to nowadays the dominant decoder-only modeling. This rapid transition, however, comes without a rigorous comparative analysis especially \textit{from the scaling perspective}, raising concerns that the potential of encoder-decoder models may have been overlooked. To fill this gap, we revisit encoder-decoder LLM (RedLLM), enhancing it with recent recipes from decoder-only LLM (DecLLM). We conduct a comprehensive comparison between RedLLM, pretrained with prefix language modeling (LM), and DecLLM, pretrained with causal LM, at different model scales, ranging from $\sim$150M to $\sim$8B. Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for instruction tuning, our experiments show that RedLLM produces compelling scaling properties and surprisingly strong performance. While DecLLM is overall more compute-optimal during pretraining, RedLLM demonstrates comparable scaling and context length extrapolation capabilities. After instruction tuning, RedLLM achieves comparable and even better results on various downstream tasks while enjoying substantially better inference efficiency. We hope our findings could inspire more efforts on re-examining RedLLM, unlocking its potential for developing powerful and efficient LLMs.
翻译:近期大型语言模型(LLM)研究经历了从编码器-解码器架构向当前主流的仅解码器架构的转变。然而,这一快速转型缺乏严谨的比较分析,尤其是从缩放视角来看,引发了编码器-解码器模型潜力可能被忽视的担忧。为填补这一空白,我们重新审视编码器-解码器大型语言模型(RedLLM),并采用仅解码器大型语言模型(DecLLM)的最新优化方法进行增强。我们系统比较了基于前缀语言建模(LM)预训练的RedLLM与基于因果语言建模预训练的DecLLM在不同模型规模(从约1.5亿到约80亿参数)下的表现。通过使用RedPajama V1(1.6万亿词元)进行预训练及FLAN进行指令微调,实验表明RedLLM展现出显著的缩放特性与超预期的强劲性能。虽然DecLLM在预训练阶段整体计算效率更优,但RedLLM表现出可比的缩放能力与上下文长度外推特性。经过指令微调后,RedLLM在多种下游任务中取得相当甚至更优的结果,同时具备显著更高的推理效率。我们希望这些发现能激发更多重新审视RedLLM的研究努力,释放其在开发强大高效大型语言模型方面的潜力。