主观深度与时间尺度变换器：学习在何处及何时进行计算 (Subjective Depth and Timescale Transformers: Learning Where and When to Compute)

The rigid, uniform allocation of computation in standard Transformer (TF) architectures can limit their efficiency and scalability, particularly for large-scale models and long sequences. Addressing this, we introduce Subjective Depth Transformers (SDT) and Subjective Timescale Transformers (STT), two distinct architectures that leverage Bayesian surprise signals to dynamically route computation, learning where and when to compute within decoder-only TFs. SDT augments a decoder-only stack with alternating Decision and Dynamic layers: a Decision layer computes a full block 'posterior' and a lightweight 'prior,' while a Dynamic layer employs fixed-capacity Top-K routing based on Bayesian surprise (Expected and Unexpected Change), maintaining a static compute graph. STT extends this conditional computation to the temporal domain: a transition network predicts residual updates, forming a temporal 'change hypothesis' that informs a router to dynamically execute or bypass TF blocks for each token, managing KV-cache contributions. Both architectures exhibit the predicted shift from novelty to prediction driven gating over training, suggesting alignment with surprise based principles. While operating at reduced capacity, they offer preliminary insights into the compute-accuracy trade-offs of conditional computation. The proposed architectures establish a flexible framework for efficiency, reducing self-attention computation by 75% and KV-cache requirements by 50% within each compute skipping layer, setting a pathway for more efficient models.

翻译：标准Transformer（TF）架构中刚性、均匀的计算分配方式会限制其效率与可扩展性，尤其对于大规模模型与长序列处理。针对此问题，我们引入了主观深度变换器（SDT）与主观时间尺度变换器（STT）这两种独特架构，它们利用贝叶斯惊奇信号动态路由计算，学习在仅解码器TF内部确定计算的位置与时机。SDT通过交替的决策层与动态层增强仅解码器堆栈：决策层计算完整的块'后验'与轻量级'先验'，而动态层则基于贝叶斯惊奇（预期与未预期变化）采用固定容量的Top-K路由策略，维持静态计算图。STT将条件计算扩展至时间域：转移网络预测残差更新，形成时间维度的'变化假设'，进而指导路由器对每个标记动态执行或跳过TF块，并管理键值缓存贡献。两种架构在训练过程中均表现出从新奇驱动到预测驱动的门控机制转变，表明其符合基于惊奇原理的设计思想。尽管在降低计算容量下运行，它们为条件计算在计算-精度权衡方面提供了初步见解。所提架构建立了灵活的效率框架，在每个计算跳过层内将自注意力计算减少75%，键值缓存需求降低50%，为构建更高效模型开辟了路径。