On the Expressivity Role of LayerNorm in Transformers' Attention - 专知论文

会员服务 ·

0

Attention · 向量化 · 层 · 规范化的 · Projection ·

2023 年 5 月 4 日

On the Expressivity Role of LayerNorm in Transformers' Attention

翻译：暂无翻译

Shaked Brody,Uri Alon,Eran Yahav

from arxiv, Accepted as a short paper in Findings of ACL 2023

Layer Normalization (LayerNorm) is an inherent component in all Transformer-based models. In this paper, we show that LayerNorm is crucial to the expressivity of the multi-head attention layer that follows it. This is in contrast to the common belief that LayerNorm's only role is to normalize the activations during the forward pass, and their gradients during the backward pass. We consider a geometric interpretation of LayerNorm and show that it consists of two components: (a) projection of the input vectors to a $d-1$ space that is orthogonal to the $\left[1,1,...,1\right]$ vector, and (b) scaling of all vectors to the same norm of $\sqrt{d}$. We show that each of these components is important for the attention layer that follows it in Transformers: (a) projection allows the attention mechanism to create an attention query that attends to all keys equally, offloading the need to learn this operation by the attention; and (b) scaling allows each key to potentially receive the highest attention, and prevents keys from being "un-select-able". We show empirically that Transformers do indeed benefit from these properties of LayeNorm in general language modeling and even in computing simple functions such as "majority". Our code is available at https://github.com/tech-srl/layer_norm_expressivity_role .

翻译：暂无翻译

1

相关内容

Attention

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

应变调控VO2相变特性与光（电）子器件应用基础

国家自然科学基金

0+阅读 · 2014年12月31日

低钾胁迫下玉米自交系叶片抗氧化防护机理研究

国家自然科学基金

0+阅读 · 2013年12月31日

高效ⅤB /ⅡB族复合光催化剂分级结构的构筑及光生载流子传输机制

国家自然科学基金

0+阅读 · 2012年12月31日

Si基子带跃迁中红外探测器研究

国家自然科学基金

0+阅读 · 2011年12月31日

高效全固态染料敏化太阳能电池纳米复合聚合物电解质的研究

国家自然科学基金

0+阅读 · 2009年12月31日

Exploring the Performance and Efficiency of Transformer Models for NLP on Mobile Devices

Arxiv

0+阅读 · 2023年6月20日

On the Robustness of Dataset Inference

Arxiv

0+阅读 · 2023年6月19日

Seq-HyGAN: Sequence Classification via Hypergraph Attention Network

Arxiv

0+阅读 · 2023年6月15日

Extreme Language Model Compression with Optimal Subwords and Shared Projections

Extreme Language Model Compression with Optimal Subwords and Shared Projections

Arxiv

18+阅读 · 2019年9月25日

DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding

Arxiv

16+阅读 · 2017年11月20日

VIP会员

文章信息

相关主题

相关VIP内容

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

【MIT博士论文】弱监督学习：理论、方法与应用

Andrej Karpathy：2025 年 LLM 年度回顾（2025 LLM Year in Review）

锚定情报：合成欺骗时代的地面真相

NeurIPS 2025 | NMKE：基于神经元归因与动态稀疏掩码的终身知识编辑

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

Exploring the Performance and Efficiency of Transformer Models for NLP on Mobile Devices

Arxiv

0+阅读 · 2023年6月20日

On the Robustness of Dataset Inference

Arxiv

0+阅读 · 2023年6月19日

Seq-HyGAN: Sequence Classification via Hypergraph Attention Network

Arxiv

0+阅读 · 2023年6月15日

Extreme Language Model Compression with Optimal Subwords and Shared Projections

Extreme Language Model Compression with Optimal Subwords and Shared Projections

Arxiv

18+阅读 · 2019年9月25日

DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding

Arxiv

16+阅读 · 2017年11月20日

相关基金

应变调控VO2相变特性与光（电）子器件应用基础

国家自然科学基金

0+阅读 · 2014年12月31日

低钾胁迫下玉米自交系叶片抗氧化防护机理研究

国家自然科学基金

0+阅读 · 2013年12月31日

高效ⅤB /ⅡB族复合光催化剂分级结构的构筑及光生载流子传输机制

国家自然科学基金

0+阅读 · 2012年12月31日

Si基子带跃迁中红外探测器研究

国家自然科学基金

0+阅读 · 2011年12月31日

高效全固态染料敏化太阳能电池纳米复合聚合物电解质的研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员