This research endeavors to offer insights into unlocking the further potential of transformer-based architectures. One of the primary motivations is to offer a geometric interpretation for the attention mechanism in transformers. In our framework, the attention mainly involves metric tensors, tangent spaces, inner product, and how they relate to each other. These quantities and structures at discrete positions are intricately interconnected via the parallel transport of tangent vectors. To make the learning process more efficient, we reduce the number of parameters through ingenious predefined configurations. Moreover, we introduce an explicit mechanism to highlight a neighborhood by attenuating the remote values, given that transformers inherently neglect local inductive bias. Experimental results demonstrate that our modules deliver significant performance improvements relative to the baseline. More evaluation experiments on visual and large language models will be launched successively.
翻译:本研究致力于揭示基于Transformer架构的进一步潜力,其主要动机之一是为Transformer中的注意力机制提供几何解释。在我们的框架中,注意力主要涉及度量张量、切空间、内积及其相互关系。这些离散位置上的量和结构通过切向量的平行输运错综复杂地相互关联。为使学习过程更高效,我们通过巧妙的预定义配置减少了参数数量。此外,鉴于Transformer固有地忽略局部归纳偏置,我们引入了一种显式机制,通过衰减远端值来突出邻域特征。实验结果表明,相较于基线模型,我们的模块带来了显著的性能提升。针对视觉和大语言模型的更多评估实验将陆续展开。