利用GPU上的批量BLAS/LAPACK核函数并行求解块三对角系统 (Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems)

We present a GPU implementation for the factorization and solution of block-tridiagonal symmetric positive definite linear systems, which commonly arise in time-dependent estimation and optimal control problems. Our method employs a recursive algorithm based on Schur complement reduction, transforming the system into a hierarchy of smaller, independent blocks that can be efficiently solved in parallel using batched BLAS/LAPACK routines. While batched routines have been used in sparse solvers, our approach applies these kernels in a tailored way by exploiting the block-tridiagonal structure known in advance. Performance benchmarks based on our open-source, cross-platform implementation, TBD-GPU, demonstrate the advantages of this tailored utilization: achieving substantial speed-ups compared to state-of-the-art CPU direct solvers, including CHOLMOD and HSL MA57, while remaining competitive with NVIDIA cuDSS. However, the current implementation still performs sequential calls of batched routines at each recursion level, and the block size must be sufficiently large to adequately amortize kernel launch overhead.

翻译：本文提出了一种GPU实现方法，用于分解和求解块三对角对称正定线性系统，这类系统常见于时变估计和最优控制问题中。我们的方法采用基于舒尔补约简的递归算法，将系统转化为层次化的、更小的独立块，这些块可通过批量BLAS/LAPACK例程高效并行求解。尽管批量例程已在稀疏求解器中得到应用，但我们的方法通过预先已知的块三对角结构，以定制化的方式应用这些核函数。基于我们开源、跨平台的实现TBD-GPU的性能基准测试表明，这种定制化利用具有显著优势：与包括CHOLMOD和HSL MA57在内的先进CPU直接求解器相比，实现了大幅加速，同时与NVIDIA cuDSS保持竞争力。然而，当前实现仍在每个递归层级顺序调用批量例程，且块大小必须足够大以充分分摊核函数启动开销。