LaMoS：通过基于SRAM的内存计算加速实现高效大数模乘运算 (LaMoS: Enabling Efficient Large Number Modular Multiplication through SRAM-based CiM Acceleration)

Barrett's algorithm is one of the most widely used methods for performing modular multiplication, a critical nonlinear operation in modern privacy computing techniques such as homomorphic encryption (HE) and zero-knowledge proofs (ZKP). Since modular multiplication dominates the processing time in these applications, computational complexity and memory limitations significantly impact performance. Computing-in-Memory (CiM) is a promising approach to tackle this problem. However, existing schemes currently suffer from two main problems: 1) Most works focus on low bit-width modular multiplication, which is inadequate for mainstream cryptographic algorithms such as elliptic curve cryptography (ECC) and the RSA algorithm, both of which require high bit-width operations; 2) Recent efforts targeting large number modular multiplication rely on inefficient in-memory logic operations, resulting in high scaling costs for larger bit-widths and increased latency. To address these issues, we propose LaMoS, an efficient SRAM-based CiM design for large-number modular multiplication, offering high scalability and area efficiency. First, we analyze the Barrett's modular multiplication method and map the workload onto SRAM CiM macros for high bit-width cases. Additionally, we develop an efficient CiM architecture and dataflow to optimize large-number modular multiplication. Finally, we refine the mapping scheme for better scalability in high bit-width scenarios using workload grouping. Experimental results show that LaMoS achieves a $7.02\times$ speedup and reduces high bit-width scaling costs compared to existing SRAM-based CiM designs.

翻译：Barrett算法是执行模乘运算最广泛使用的方法之一，模乘是现代隐私计算技术如同态加密（HE）和零知识证明（ZKP）中的关键非线性操作。由于模乘在这些应用中占据主要处理时间，计算复杂性和内存限制显著影响性能。内存计算（CiM）是解决这一问题的有前景的途径。然而，现有方案目前面临两个主要问题：1）大多数研究聚焦于低位宽模乘运算，这不足以满足主流密码算法如椭圆曲线密码学（ECC）和RSA算法的需求，两者均需要高位宽运算；2）近期针对大数模乘的研究依赖于低效的内存内逻辑操作，导致较大位宽的扩展成本高昂且延迟增加。为解决这些问题，我们提出LaMoS，一种基于SRAM的高效CiM设计，专用于大数模乘运算，具有高可扩展性和面积效率。首先，我们分析Barrett模乘方法，并将高位宽情况下的计算任务映射到SRAM CiM宏单元上。此外，我们开发了一种高效的CiM架构和数据流以优化大数模乘运算。最后，我们通过任务分组改进映射方案，以在高端位宽场景中实现更好的可扩展性。实验结果表明，与现有基于SRAM的CiM设计相比，LaMoS实现了$7.02\times$的加速比，并降低了高位宽扩展成本。