RAT：基于分块的序列建模——连接RNN效率与注意力机制精度 (RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling)

Transformers have become the cornerstone of modern large-scale language models, but their reliance on softmax attention poses a computational bottleneck at both training and inference. Recurrent models offer high efficiency, but compressing the full sequence into a fixed-size and holistic representation can suffer from memory degradation in long contexts and limit fine-grained retrieval. To address this, we propose RAT, an intermediate design that bridges the efficiency of RNNs and capacity of attention. RAT partitions the input into chunks, applies recurrence within each chunk for local dependencies, and softmax-based attention across chunks for long-range interactions. This design mitigates memory degradation and enables direct access to distant tokens, while retaining computational efficiency. Empirically, with a chunk size of 16, the RAT block achieves a 7$\times$ improvement in training speed for 100K sequence length and 9$times$ in generation at the 4K position, while maintaining similar performance compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves RAT with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage, but also consistently enhances performance and shows the overall best results. Code is available at https://github.com/CLAIRE-Labo/RAT.

翻译：Transformer已成为现代大规模语言模型的基石，但其对softmax注意力机制的依赖在训练和推理阶段均构成计算瓶颈。循环模型虽具备高效性，但将完整序列压缩为固定尺寸的整体表示可能导致长上下文中的记忆退化，并限制细粒度检索能力。为此，我们提出RAT——一种连接RNN效率与注意力容量的中间设计方案。RAT将输入序列划分为分块，在每个分块内部应用循环机制处理局部依赖，并通过基于softmax的注意力机制实现跨分块的长程交互。该设计既缓解了记忆退化问题，实现了对远端标记的直接访问，又保持了计算效率。实验表明：在分块大小为16时，RAT模块在10万序列长度下实现7倍训练速度提升，在4K位置实现9倍生成加速，同时保持与标准注意力机制相当的性能。我们通过从头训练13亿参数模型并进行大规模评估验证该结论，评估涵盖短/长上下文基准测试及监督微调（SFT）。进一步提出融合局部注意力的混合架构，通过结合高效长程建模与强局部交互，该混合设计不仅提升推理速度、降低缓存内存占用，还持续增强性能并展现整体最优结果。代码发布于https://github.com/CLAIRE-Labo/RAT。