RFX：基于GPU加速与QLORA压缩的高性能随机森林 (RFX: High-Performance Random Forests with GPU Acceleration and QLORA Compression)

RFX (Random Forests X), where X stands for compression or quantization, presents a production-ready implementation of Breiman and Cutler's Random Forest classification methodology in Python. RFX v1.0 provides complete classification: out-of-bag error estimation, overall and local importance measures, proximity matrices with QLORA compression, case-wise analysis, and interactive visualization (rfviz)--all with CPU and GPU acceleration. Regression, unsupervised learning, CLIQUE importance, and RF-GAP proximity are planned for v2.0. This work introduces four solutions addressing the proximity matrix memory bottleneck limiting Random Forest analysis to ~60,000 samples: (1) QLORA (Quantized Low-Rank Adaptation) compression for GPU proximity matrices, reducing memory from 80GB to 6.4MB for 100k samples (12,500x compression with INT8 quantization) while maintaining 99% geometric structure preservation, (2) CPU TriBlock proximity--combining upper-triangle storage with block-sparse thresholding--achieving 2.7x memory reduction with lossless quality, (3) SM-aware GPU batch sizing achieving 95% GPU utilization, and (4) GPU-accelerated 3D MDS visualization computing embeddings directly from low-rank factors using power iteration. Validation across four implementation modes (GPU/CPU x case-wise/non-case-wise) demonstrates correct implementation. GPU achieves 1.4x speedup over CPU for overall importance with 500+ trees. Proximity computation scales from 1,000 to 200,000+ samples (requiring GPU QLORA), with CPU TriBlock filling the gap for medium-scale datasets (10K-50K samples). RFX v1.0 eliminates the proximity memory bottleneck, enabling proximity-based Random Forest analysis on datasets orders of magnitude larger than previously feasible. Open-source production-ready classification following Breiman and Cutler's original methodology.

翻译：RFX（随机森林X，其中X代表压缩或量化）提供了一个生产就绪的Python实现，基于Breiman和Cutler的随机森林分类方法。RFX v1.0提供了完整的分类功能：袋外误差估计、整体与局部重要性度量、采用QLORA压缩的邻近矩阵、个案分析以及交互式可视化（rfviz）——所有这些均支持CPU和GPU加速。回归、无监督学习、CLIQUE重要性及RF-GAP邻近性计划在v2.0中实现。本研究提出了四种解决方案，以解决限制随机森林分析样本量约60,000的邻近矩阵内存瓶颈：（1）针对GPU邻近矩阵的QLORA（量化低秩适应）压缩，将10万样本的内存占用从80GB降至6.4MB（通过INT8量化实现12,500倍压缩），同时保持99%的几何结构完整性；（2）CPU TriBlock邻近性——结合上三角存储与块稀疏阈值化——实现2.7倍无损内存压缩；（3）SM感知的GPU批处理大小调整，达到95%的GPU利用率；（4）GPU加速的3D MDS可视化，通过幂迭代直接从低秩因子计算嵌入。在四种实现模式（GPU/CPU × 个案/非个案分析）上的验证证明了实现的正确性。在500+棵树的情况下，GPU在整体重要性计算上比CPU快1.4倍。邻近性计算可扩展至1,000到200,000+样本（大规模需GPU QLORA），而CPU TriBlock填补了中等规模数据集（1万-5万样本）的空白。RFX v1.0消除了邻近性内存瓶颈，使得基于邻近性的随机森林分析能够处理比以往可行规模大数个数量级的数据集。本项目为遵循Breiman和Cutler原始方法的开源生产就绪分类工具。