INT 与 FP：细粒度低比特量化格式的全面研究 (INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats)

Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.

翻译：现代AI硬件，如Nvidia的Blackwell架构，正日益采用低精度浮点（FP）格式来处理大型语言模型（LLMs）中普遍存在的激活值异常值。尽管存在这一行业趋势，但针对不同粒度下FP与整数（INT）量化的统一比较一直缺失，导致算法与硬件协同设计缺乏明确指导。本文通过系统研究FP与INT格式之间的权衡填补了这一空白。我们揭示了一个关键的性能交叉点：虽然FP在粗粒度量化中表现出色，但在细粒度（块级）水平的比较则更为微妙。我们的全面比较表明，对于流行的8位细粒度格式（例如块大小为32的MX），MXINT8在算法精度和硬件效率上均优于其FP对应格式。然而，对于4位格式，FP（例如MXFP4、NVFP4）通常具有精度优势，但我们发现当应用Hadamard旋转等异常值缓解技术时，NVINT4可以超越NVFP4。我们还引入了一种对称裁剪方法，解决了细粒度低比特INT训练中的梯度偏差问题，使MXINT8训练能够实现近乎无损的性能。这些发现挑战了当前的硬件发展轨迹，表明一刀切的FP方法并非最优，并主张细粒度INT格式，特别是MXINT8，为未来AI加速器提供了精度、功耗和效率之间更好的平衡。