Barrett's algorithm is one of the most widely used methods for performing modular multiplication, a critical nonlinear operation in modern privacy computing techniques such as homomorphic encryption (HE) and zero-knowledge proofs (ZKP). Since modular multiplication dominates the processing time in these applications, computational complexity and memory limitations significantly impact performance. Computing-in-Memory (CiM) is a promising approach to tackle this problem. However, existing schemes currently suffer from two main problems: 1) Most works focus on low bit-width modular multiplication, which is inadequate for mainstream cryptographic algorithms such as elliptic curve cryptography (ECC) and the RSA algorithm, both of which require high bit-width operations; 2) Recent efforts targeting large number modular multiplication rely on inefficient in-memory logic operations, resulting in high scaling costs for larger bit-widths and increased latency. To address these issues, we propose LaMoS, an efficient SRAM-based CiM design for large-number modular multiplication, offering high scalability and area efficiency. First, we analyze the Barrett's modular multiplication method and map the workload onto SRAM CiM macros for high bit-width cases. Additionally, we develop an efficient CiM architecture and dataflow to optimize large-number modular multiplication. Finally, we refine the mapping scheme for better scalability in high bit-width scenarios using workload grouping. Experimental results show that LaMoS achieves a $7.02\times$ speedup and reduces high bit-width scaling costs compared to existing SRAM-based CiM designs.
翻译:Barrett算法是执行模乘运算最广泛使用的方法之一,模乘是现代隐私计算技术如同态加密(HE)和零知识证明(ZKP)中的关键非线性操作。由于模乘在这些应用中占据主要处理时间,计算复杂性和内存限制显著影响性能。内存计算(CiM)是解决这一问题的有前景的途径。然而,现有方案目前面临两个主要问题:1)大多数研究聚焦于低位宽模乘运算,这不足以满足主流密码算法如椭圆曲线密码学(ECC)和RSA算法的需求,两者均需要高位宽运算;2)近期针对大数模乘的研究依赖于低效的内存内逻辑操作,导致较大位宽的扩展成本高昂且延迟增加。为解决这些问题,我们提出LaMoS,一种基于SRAM的高效CiM设计,专用于大数模乘运算,具有高可扩展性和面积效率。首先,我们分析Barrett模乘方法,并将高位宽情况下的计算任务映射到SRAM CiM宏单元上。此外,我们开发了一种高效的CiM架构和数据流以优化大数模乘运算。最后,我们通过任务分组改进映射方案,以在高端位宽场景中实现更好的可扩展性。实验结果表明,与现有基于SRAM的CiM设计相比,LaMoS实现了$7.02\times$的加速比,并降低了高位宽扩展成本。