Recently, post-training quantization (PTQ) has driven much attention to produce efficient neural networks without long-time retraining. Despite its low cost, current PTQ works tend to fail under the extremely low-bit setting. In this study, we pioneeringly confirm that properly incorporating activation quantization into the PTQ reconstruction benefits the final accuracy. To deeply understand the inherent reason, a theoretical framework is established, indicating that the flatness of the optimized low-bit model on calibration and test data is crucial. Based on the conclusion, a simple yet effective approach dubbed as QDROP is proposed, which randomly drops the quantization of activations during PTQ. Extensive experiments on various tasks including computer vision (image classification, object detection) and natural language processing (text classification and question answering) prove its superiority. With QDROP, the limit of PTQ is pushed to the 2-bit activation for the first time and the accuracy boost can be up to 51.49%. Without bells and whistles, QDROP establishes a new state of the art for PTQ. Our code is available at https://github.com/wimh966/QDrop and has been integrated into MQBench (https://github.com/ModelTC/MQBench)
翻译:最近,培训后定量化(PTQ)促使人们大量关注在没有长期再培训的情况下生产高效的神经网络。尽管成本低,但目前的PTQ工程在极低位设置下往往会失败。在本研究中,我们先行确认,将激活量化化适当纳入PTQ重建(PTQ 重建)将有利于最终准确性。为了深入理解内在原因,建立了一个理论框架,表明最佳低位校准和测试数据模型的平准性至关重要。根据结论,提出了一个简单而有效的方法,称为QDROP(QDROP),该方法在PTQ期间随机地降低了激活的量化。关于计算机视觉(图像分类、对象检测)和自然语言处理(文本分类和回答)等各项任务的广泛实验证明了其优越性。通过QDROPPPPQ(图像分类、对象检测)和自然语言处理(文本分类和回答),PTQQ的限度被推到第一次的2位点激活,精确度提升可高达51.49%。没有Bers和哨,QDROPPPT(http/MUBQ)将新的艺术状态降为Q。我们的代码在 http://MsgiQ/MsgimbQ。