深度序列模型倾向于以几何方式记忆；其原因尚不明确 (Deep sequence models tend to memorize geometrically; it is unclear why)

In sequence modeling, the parametric memory of atomic facts has been predominantly abstracted as a brute-force lookup of co-occurrences between entities. We contrast this associative view against a geometric view of how memory is stored. We begin by isolating a clean and analyzable instance of Transformer reasoning that is incompatible with memory as strictly a storage of the local co-occurrences specified during training. Instead, the model must have somehow synthesized its own geometry of atomic facts, encoding global relationships between all entities, including non-co-occurring ones. This in turn has simplified a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn 1-step geometric task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, despite optimizing over mere local associations, cannot be straightforwardly attributed to typical architectural or optimizational pressures. Counterintuitively, an elegant geometry is learned even when it is not more succinct than a brute-force lookup of associations. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery and unlearning.

翻译：在序列建模中，原子事实的参数化记忆通常被抽象为实体间共现关系的暴力查找。我们将这种关联性视角与记忆存储的几何视角进行对比。首先，我们分离出一个清晰且可分析的Transformer推理实例，该实例与将记忆严格视为训练期间指定的局部共现存储的观点不相容。相反，模型必然以某种方式合成了其自身的原子事实几何结构，编码了所有实体（包括非共现实体）之间的全局关系。这进而将一个涉及ℓ重组合的困难推理任务简化为易于学习的一步几何任务。从这一现象中，我们提取出难以解释的神经嵌入几何的基本特征。我们认为，尽管仅通过优化局部关联进行训练，这种几何结构的形成无法简单归因于典型的架构或优化压力。反直觉的是，即使几何表示并不比暴力关联查找更简洁，模型仍能学习到优雅的几何结构。接着，通过分析与Node2Vec的关联，我们证明这种几何结构源于一种谱偏置——与主流理论相反——该偏置确实会在缺乏各种压力的情况下自然产生。此分析也为实践者指出了提升Transformer记忆几何强度的可见改进空间。我们希望参数化记忆的几何视角能促使研究者重新审视知识获取、容量、发现与遗忘等领域中的默认直觉。

相关内容

MoDELS

关注 0

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日