In sequence modeling, the parametric memory of atomic facts has been predominantly abstracted as a brute-force lookup of co-occurrences between entities. We contrast this associative view against a geometric view of how memory is stored. We begin by isolating a clean and analyzable instance of Transformer reasoning that is incompatible with memory as strictly a storage of the local co-occurrences specified during training. Instead, the model must have somehow synthesized its own geometry of atomic facts, encoding global relationships between all entities, including non-co-occurring ones. This in turn has simplified a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn 1-step geometric task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, despite optimizing over mere local associations, cannot be straightforwardly attributed to typical architectural or optimizational pressures. Counterintuitively, an elegant geometry is learned even when it is not more succinct than a brute-force lookup of associations. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery and unlearning.
翻译:在序列建模中,原子事实的参数化记忆通常被抽象为实体间共现关系的暴力查找。我们将这种关联性视角与记忆存储的几何视角进行对比。首先,我们分离出一个清晰且可分析的Transformer推理实例,该实例与将记忆严格视为训练期间指定的局部共现存储的观点不相容。相反,模型必然以某种方式合成了其自身的原子事实几何结构,编码了所有实体(包括非共现实体)之间的全局关系。这进而将一个涉及ℓ重组合的困难推理任务简化为易于学习的一步几何任务。从这一现象中,我们提取出难以解释的神经嵌入几何的基本特征。我们认为,尽管仅通过优化局部关联进行训练,这种几何结构的形成无法简单归因于典型的架构或优化压力。反直觉的是,即使几何表示并不比暴力关联查找更简洁,模型仍能学习到优雅的几何结构。接着,通过分析与Node2Vec的关联,我们证明这种几何结构源于一种谱偏置——与主流理论相反——该偏置确实会在缺乏各种压力的情况下自然产生。此分析也为实践者指出了提升Transformer记忆几何强度的可见改进空间。我们希望参数化记忆的几何视角能促使研究者重新审视知识获取、容量、发现与遗忘等领域中的默认直觉。