Transformer在学习马尔可夫动态函数中的最优性与NP难解性 (Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions)

Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL). Existing theoretical studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs. To understand how transformers express ICL when modeling dynamics-driven functions, we investigate Markovian function learning through a structured ICL setup, where we characterize the loss landscape to reveal underlying optimization behaviors. Specifically, we (1) provide the closed-form expression of the global minimizer (in an enlarged parameter space) for a single-layer linear self-attention (LSA) model; (2) prove that recovering transformer parameters that realize the optimal solution is NP-hard in general, revealing a fundamental limitation of one-layer LSA in representing structured dynamical functions; and (3) supply a novel interpretation of a multilayer LSA as performing preconditioned gradient descent to optimize multiple objectives beyond the square loss. These theoretical results are numerically validated using simplified transformers.

翻译：Transformer架构能够通过上下文学习（ICL）基于给定提示中的输入-输出对解决未见任务。现有关于ICL的理论研究主要集中于线性回归任务，且通常假设输入独立同分布。为理解Transformer在建模动态驱动函数时如何实现ICL，我们通过结构化ICL框架研究马尔可夫函数学习，通过刻画损失景观揭示潜在的优化行为。具体而言，我们（1）给出了单层线性自注意力（LSA）模型在扩展参数空间中全局最小值的闭式表达式；（2）证明在一般情况下，恢复实现最优解的Transformer参数是NP难问题，这揭示了一层LSA在表示结构化动态函数时的根本局限性；（3）提出多层LSA可解释为执行预处理梯度下降以优化超越平方损失的多个目标。这些理论结果通过简化Transformer模型进行了数值验证。