We introduce a high-throughput neural network accelerator that embeds most network layers directly in hardware, minimizing data transfer and memory usage while preserving a degree of flexibility via a small neural processing unit for the final classification layer. By leveraging power-of-two (Po2) quantization for weights, we replace multiplications with simple rewiring, effectively reducing each convolution to a series of additions. This streamlined approach offers high-throughput, energy-efficient processing, making it highly suitable for applications where model parameters remain stable, such as continuous sensing tasks at the edge or large-scale data center deployments. Furthermore, by including a strategically chosen reprogrammable final layer, our design achieves high throughput without sacrificing fine-tuning capabilities. We implement this accelerator in a 7nm ASIC flow using MobileNetV2 as a baseline and report throughput, area, accuracy, and sensitivity to quantization and pruning - demonstrating both the advantages and potential trade-offs of the proposed architecture. We find that for MobileNetV2, we can improve inference throughput by 20x over fully programmable GPUs, processing 1.21 million images per second through a full forward pass while retaining fine-tuning flexibility. If absolutely no post-deployment fine tuning is required, this advantage increases to 67x at 4 million images per second.
翻译:我们提出了一种高吞吐量神经网络加速器,该加速器将大多数网络层直接嵌入硬件中,通过最小化数据传输和内存使用,同时保留一个用于最终分类层的小型神经处理单元以实现一定程度的灵活性。通过利用权重的二次幂量化,我们将乘法运算替换为简单的重布线操作,从而将每个卷积运算简化为一系列加法运算。这种简化方法提供了高吞吐量、高能效的处理能力,使其非常适合模型参数保持稳定的应用场景,例如边缘连续感知任务或大规模数据中心部署。此外,通过包含一个经过战略选择的可重编程最终层,我们的设计在不牺牲微调能力的情况下实现了高吞吐量。我们使用MobileNetV2作为基准,在7纳米ASIC流程中实现了该加速器,并报告了吞吐量、面积、精度以及对量化和剪枝的敏感性,展示了所提出架构的优势和潜在权衡。我们发现,对于MobileNetV2,与完全可编程GPU相比,我们可以将推理吞吐量提高20倍,通过完整前向传播每秒处理121万张图像,同时保持微调灵活性。如果完全不需要部署后微调,这一优势可提升至67倍,达到每秒处理400万张图像。