Snarls and superbubbles are fundamental pangenome decompositions capturing variant sites. These bubble-like structures underpin key tasks in computational pangenomics, including structural-variant genotyping, distance indexing, haplotype sampling, and variant annotation. Snarls can be quadratically-many in the size of the graph, and since their introduction in 2018 with the vg toolkit, there has been no work on identifying all snarls in linear time. Moreover, while it is known how to find superbubbles in linear time, this result is a highly specialized solution only achieved after a long series of papers. We present the first algorithm identifying all snarls in linear time. This is based on a new representation of all snarls, of size linear in the input graph size, and which can be computed in linear time. Our algorithm is based on a unified framework that also provides a new linear-time algorithm for finding superbubbles. An observation behind our results is that all such structures are separated from the rest of the graph by two vertices (except for cases which are trivially computable), i.e. their endpoints are a 2-separator of the underlying undirected graph. Based on this, we employ the well-known SPQR tree decomposition, which encodes all 2-separators, to guide a traversal that finds the bubble-like structures efficiently. We implemented our algorithms in C++ (available at https://github.com/algbio/BubbleFinder) and evaluated them on various pangenomic datasets. Our algorithms outcompete or they are on the same level of existing methods. For snarls, we are up to two times faster than vg, while identifying all snarls. When computing superbubbles, we are up to 50 times faster than BubbleGun. Our SPQR tree framework provides a unifying perspective on bubble-like structures in pangenomics, together with a template for finding other bubble-like structures efficiently.
翻译:Snarl与superbubble是捕获变异位点的基本泛基因组分解结构。这类气泡状结构支撑着计算泛基因组学的关键任务,包括结构变异基因分型、距离索引、单倍型采样和变异注释。Snarl的数量可能达到图规模的二次方,自2018年随vg工具包提出以来,尚未有工作实现线性时间识别所有snarl。此外,虽然已知如何在线性时间内寻找superbubble,但该结果是经过长期系列研究才实现的专门化解决方案。本文提出首个线性时间识别所有snarl的算法,其基于新型的snarl全结构表示法——该表示规模与输入图大小呈线性关系,且可在线性时间内计算完成。本算法基于统一框架,同时为superbubble的发现提供了新的线性时间算法。我们成果的核心观察是:所有此类结构(除可平凡计算的情况外)均通过两个顶点与图其余部分分离,即其端点构成底层无向图的2-分隔子。基于此,我们采用经典的SPQR树分解(可编码所有2-分隔子)来指导遍历过程,从而高效发现气泡状结构。我们在C++中实现了算法(代码见https://github.com/algbio/BubbleFinder),并在多种泛基因组数据集上进行了评估。本算法性能优于或持平现有方法:对于snarl识别,在发现全部snarl的同时速度较vg提升最高达两倍;对于superbubble计算,速度较BubbleGun提升最高达50倍。本研究的SPQR树框架为泛基因组学中的气泡状结构提供了统一视角,并为高效发现其他气泡状结构提供了模板。