In this paper, we implement a stand-alone facial expression recognition system on an SoC FPGA with multi-threading using a Deep learning Processor Unit (DPU). The system consists of two steps: one for face detection step and one for facial expression recognition. In the previous work, the Haar Cascade detector was run on a CPU in the face detection step due to FPGA resource limitations, but this detector is less accurate for profile and variable illumination condition images. Moreover, the previous work used a dedicated circuit accelerator, so running a second DNN inference for face detection on the FPGA would require the addition of a new accelerator. As an alternative to this approach, we run the two inferences by DNN on a DPU, which is a general-purpose CNN accelerator of the systolic array type. Our method for face detection using DenseBox and facial expression recognition using CNN on the same DPU enables the efficient use of FPGA resources while maintaining a small circuit size. We also developed a multi-threading technique that improves the overall throughput while increasing the DPU utilization efficiency. With this approach, we achieved an overall system throughput of 25 FPS and a throughput per power consumption of 2.4 times.
翻译:本文在SoC FPGA上实现了一种基于深度学习处理器单元(DPU)的多线程独立面部表情识别系统。该系统包含两个步骤:人脸检测步骤与面部表情识别步骤。先前工作中,由于FPGA资源限制,人脸检测步骤采用Haar Cascade检测器在CPU上运行,但该检测器对侧面及变光照条件图像的准确性较低。此外,先前工作采用专用电路加速器,若在FPGA上运行第二个人脸检测DNN推理,则需额外增加新加速器。作为替代方案,我们通过DPU(一种脉动阵列型通用CNN加速器)运行两个DNN推理。我们在同一DPU上采用DenseBox进行人脸检测,并利用CNN进行面部表情识别,该方法在保持电路规模精简的同时实现了FPGA资源的高效利用。我们还开发了多线程技术,在提升DPU利用率的同时提高了系统整体吞吐量。通过该方案,我们实现了25 FPS的系统总吞吐量,单位功耗吞吐量达到原有方案的2.4倍。