编码器-解码器还是仅解码器？重新审视编码器-解码器大型语言模型 (Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model)

Recent large language model (LLM) research has undergone an architectural shift from encoder-decoder modeling to nowadays the dominant decoder-only modeling. This rapid transition, however, comes without a rigorous comparative analysis especially \textit{from the scaling perspective}, raising concerns that the potential of encoder-decoder models may have been overlooked. To fill this gap, we revisit encoder-decoder LLM (RedLLM), enhancing it with recent recipes from decoder-only LLM (DecLLM). We conduct a comprehensive comparison between RedLLM, pretrained with prefix language modeling (LM), and DecLLM, pretrained with causal LM, at different model scales, ranging from $\sim$150M to $\sim$8B. Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for instruction tuning, our experiments show that RedLLM produces compelling scaling properties and surprisingly strong performance. While DecLLM is overall more compute-optimal during pretraining, RedLLM demonstrates comparable scaling and context length extrapolation capabilities. After instruction tuning, RedLLM achieves comparable and even better results on various downstream tasks while enjoying substantially better inference efficiency. We hope our findings could inspire more efforts on re-examining RedLLM, unlocking its potential for developing powerful and efficient LLMs.

翻译：近期大型语言模型（LLM）研究经历了从编码器-解码器架构向当前主流的仅解码器架构的转变。然而，这一快速转型缺乏严谨的比较分析，尤其是从缩放视角来看，引发了编码器-解码器模型潜力可能被忽视的担忧。为填补这一空白，我们重新审视编码器-解码器大型语言模型（RedLLM），并采用仅解码器大型语言模型（DecLLM）的最新优化方法进行增强。我们系统比较了基于前缀语言建模（LM）预训练的RedLLM与基于因果语言建模预训练的DecLLM在不同模型规模（从约1.5亿到约80亿参数）下的表现。通过使用RedPajama V1（1.6万亿词元）进行预训练及FLAN进行指令微调，实验表明RedLLM展现出显著的缩放特性与超预期的强劲性能。虽然DecLLM在预训练阶段整体计算效率更优，但RedLLM表现出可比的缩放能力与上下文长度外推特性。经过指令微调后，RedLLM在多种下游任务中取得相当甚至更优的结果，同时具备显著更高的推理效率。我们希望这些发现能激发更多重新审视RedLLM的研究努力，释放其在开发强大高效大型语言模型方面的潜力。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日