基于图的近似最近邻搜索再审视：理论分析与优化 (Graph-Based Approximate Nearest Neighbor Search Revisited: Theoretical Analysis and Optimization)

Graph-based approaches to approximate nearest neighbor search (ANNS) have achieved remarkable success in enabling fast, high-recall retrieval on billion-scale vector datasets. Among them, the Sparse Neighborhood Graph (SNG) has emerged as a widely adopted graph structure due to its superior search performance. However, the theoretical understanding of SNG remains limited, leading to reliance on heuristic-based and often suboptimal truncation strategies. In this work, we aim to bridge the gap between theory and practice by providing formal guarantees for graph-based ANNS methods and proposing principled optimization strategies for the truncation parameter. By characterizing the index construction process through martingale-based analysis, we show that the degree of the index graph is $O(n^{2/3+\epsilon})$, where $\epsilon$ is an arbitrarily small constant. Furthermore, we prove that the expected search path length during query processing is $O(\log n)$. Based on these theoretical insights, we introduce a novel and principled method for selecting the truncation parameter $R$ in SNG. Experimental results demonstrate that our method achieves comparable or superior performance in terms of query latency and Recall@10 compared to commonly used binary search heuristics, while yielding 2x to 9x speedups in overall index construction.

翻译：基于图的近似最近邻搜索方法在实现十亿级向量数据集上的快速、高召回率检索方面取得了显著成功。其中，稀疏邻域图因其卓越的搜索性能已成为广泛采用的图结构。然而，目前对SNG的理论理解仍较为有限，导致实践中依赖启发式且往往次优的截断策略。本研究旨在通过为基于图的ANNS方法提供形式化保证，并提出截断参数的原理性优化策略，以弥合理论与实践的鸿沟。通过基于鞅分析的索引构建过程刻画，我们证明索引图的度数为$O(n^{2/3+\epsilon})$，其中$\epsilon$为任意小常数。进一步，我们证实在查询处理过程中期望搜索路径长度为$O(\log n)$。基于这些理论洞见，我们提出了一种新颖且原理性的SNG截断参数$R$选择方法。实验结果表明，与常用的二分搜索启发式方法相比，我们的方法在查询延迟和Recall@10指标上达到相当或更优的性能，同时整体索引构建速度提升2至9倍。