Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL). Existing theoretical studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs. To understand how transformers express ICL when modeling dynamics-driven functions, we investigate Markovian function learning through a structured ICL setup, where we characterize the loss landscape to reveal underlying optimization behaviors. Specifically, we (1) provide the closed-form expression of the global minimizer (in an enlarged parameter space) for a single-layer linear self-attention (LSA) model; (2) prove that recovering transformer parameters that realize the optimal solution is NP-hard in general, revealing a fundamental limitation of one-layer LSA in representing structured dynamical functions; and (3) supply a novel interpretation of a multilayer LSA as performing preconditioned gradient descent to optimize multiple objectives beyond the square loss. These theoretical results are numerically validated using simplified transformers.
翻译:Transformer架构能够通过上下文学习(ICL)基于给定提示中的输入-输出对解决未见任务。现有关于ICL的理论研究主要集中于线性回归任务,且通常假设输入独立同分布。为理解Transformer在建模动态驱动函数时如何实现ICL,我们通过结构化ICL框架研究马尔可夫函数学习,通过刻画损失景观揭示潜在的优化行为。具体而言,我们(1)给出了单层线性自注意力(LSA)模型在扩展参数空间中全局最小值的闭式表达式;(2)证明在一般情况下,恢复实现最优解的Transformer参数是NP难问题,这揭示了一层LSA在表示结构化动态函数时的根本局限性;(3)提出多层LSA可解释为执行预处理梯度下降以优化超越平方损失的多个目标。这些理论结果通过简化Transformer模型进行了数值验证。