Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers. While these expressive models do not yet match the peak performance of state-of-the-art Transformers, their potential has been largely untapped due to prohibitively slow training and low hardware utilization. Existing parallelization methods force a fundamental conflict governed by the chunksize hyperparameter: large chunks boost speed but degrade performance, necessitating a fixed, suboptimal compromise. To solve this challenge, we introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process. Stage one is an efficiency-focused pre-training phase utilizing a hierarchical memory. A global module processes large, hardware-friendly chunks for long-range context, while multiple parallel local modules handle fine-grained details. Crucially, by periodically resetting local memory states, we break sequential dependencies to enable massive context parallelization. Stage two is a brief fine-tuning phase where only the local memory modules are adapted to a smaller, high-resolution chunksize, maximizing accuracy with minimal overhead. Evaluated on Titans and TTT models, TNT achieves a substantial acceleration in training speed-up to 17 times faster than the most accurate baseline configuration - while simultaneously improving model accuracy. This improvement removes a critical scalability barrier, establishing a practical foundation for developing expressive RNNs and facilitating future work to close the performance gap with Transformers.
翻译:配备深度测试时记忆模块(如Titans和TTT)的循环神经网络(RNN)代表了一种与Transformer不同的、具有线性扩展潜力的新兴范式。尽管这些高表达能力模型尚未达到最先进Transformer的峰值性能,但由于训练速度极慢且硬件利用率低下,其潜力尚未得到充分挖掘。现有的并行化方法受制于分块大小超参数的根本性矛盾:大分块可提升速度但损害性能,迫使研究者采用固定且次优的折中方案。为解决这一难题,我们提出TNT——一种通过两阶段流程将训练效率与推理性能解耦的新型训练范式。第一阶段为注重效率的预训练阶段,采用分层记忆机制:全局模块处理硬件友好的大分块以捕获长程上下文,多个并行局部模块则处理细粒度细节。关键创新在于,通过周期性重置局部记忆状态,我们打破了序列依赖以实现大规模上下文并行化。第二阶段为简短微调阶段,仅将局部记忆模块适配到更小的高分辨率分块,以最小开销实现精度最大化。在Titans和TTT模型上的评估表明,TNT实现了高达17倍于最精确基线配置的训练加速,同时提升了模型精度。这一突破消除了关键的可扩展性障碍,为开发高表达能力RNN奠定了实用基础,并为未来缩小与Transformer的性能差距提供了支撑。