低秩GEMM：基于低秩近似与FP8加速的高效矩阵乘法 (Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration)

Large matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., $\mathcal{O}(n^3)$ for a matrix of size $n\times n$). We present Low-Rank GEMM, a novel approach that leverages low-rank matrix approximations to achieve sub-quadratic complexity while maintaining hardware-accelerated performance through FP8 precision and intelligent kernel selection. On a NVIDIA RTX 4090, our implementation achieves up to 378 TFLOPS on matrices up to $N=20480$, providing 75\% memory savings and $7.8\times$ speedup over PyTorch FP32 for large matrices. The system automatically adapts to hardware capabilities, selecting optimal decomposition methods (SVD, randomized SVD) and precision levels based on matrix characteristics and available accelerators. Comprehensive benchmarking on NVIDIA RTX 4090 demonstrates that Low-Rank GEMM becomes the fastest approach for matrices $N\geq10240$, surpassing traditional cuBLAS implementations through memory bandwidth optimization rather than computational shortcuts.

翻译：大规模矩阵乘法是现代机器学习工作负载的基石，然而传统方法受限于立方级计算复杂度（例如，对于尺寸为$n\\times n$的矩阵，复杂度为$\\mathcal{O}(n^3)$）。本文提出低秩GEMM，这是一种利用低秩矩阵近似实现亚二次复杂度，同时通过FP8精度与智能内核选择保持硬件加速性能的新方法。在NVIDIA RTX 4090上，我们的实现在尺寸高达$N=20480$的矩阵上实现了最高378 TFLOPS的性能，相比PyTorch FP32在大规模矩阵上节省75%内存并带来$7.8\\times$的加速。该系统能自适应硬件能力，根据矩阵特性与可用加速器选择最优分解方法（SVD、随机化SVD）与精度级别。在NVIDIA RTX 4090上的全面基准测试表明，对于$N\\geq10240$的矩阵，低秩GEMM通过内存带宽优化而非计算捷径，成为超越传统cuBLAS实现的最快方法。