Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.
翻译:现代AI硬件,如Nvidia的Blackwell架构,正日益采用低精度浮点(FP)格式来处理大型语言模型(LLMs)中普遍存在的激活值异常值。尽管存在这一行业趋势,但针对不同粒度下FP与整数(INT)量化的统一比较一直缺失,导致算法与硬件协同设计缺乏明确指导。本文通过系统研究FP与INT格式之间的权衡填补了这一空白。我们揭示了一个关键的性能交叉点:虽然FP在粗粒度量化中表现出色,但在细粒度(块级)水平的比较则更为微妙。我们的全面比较表明,对于流行的8位细粒度格式(例如块大小为32的MX),MXINT8在算法精度和硬件效率上均优于其FP对应格式。然而,对于4位格式,FP(例如MXFP4、NVFP4)通常具有精度优势,但我们发现当应用Hadamard旋转等异常值缓解技术时,NVINT4可以超越NVFP4。我们还引入了一种对称裁剪方法,解决了细粒度低比特INT训练中的梯度偏差问题,使MXINT8训练能够实现近乎无损的性能。这些发现挑战了当前的硬件发展轨迹,表明一刀切的FP方法并非最优,并主张细粒度INT格式,特别是MXINT8,为未来AI加速器提供了精度、功耗和效率之间更好的平衡。