Filtered Approximate Nearest Neighbor (ANN) search retrieves the closest vectors for a query vector from a dataset. It enforces that a specified set of discrete labels $S$ for the query must be included in the labels of each retrieved vector. Existing graph-based methods typically incorporate filter awareness by assigning fixed penalties or prioritizing nodes based on filter satisfaction. However, since these methods use fixed, data in- dependent penalties, they often fail to generalize across datasets with diverse label and vector distributions. In this work, we propose a principled alternative that learns the optimal trade-off between vector distance and filter match directly from the data, rather than relying on fixed penalties. We formulate this as a constrained linear optimization problem, deriving weights that better reflect the underlying filter distribution and more effectively address the filtered ANN search problem. These learned weights guide both the search process and index construction, leading to graph structures that more effectively capture the underlying filter distribution and filter semantics. Our experiments demonstrate that adapting the distance function to the data significantly im- proves accuracy by 5-10% over fixed-penalty methods, providing a more flexible and generalizable framework for the filtered ANN search problem.
翻译:带过滤的近似最近邻搜索从数据集中检索与查询向量最接近的向量,同时要求查询指定的离散标签集合 $S$ 必须包含在每个检索向量的标签中。现有的基于图的方法通常通过分配固定惩罚或根据过滤器满足情况优先选择节点来实现滤波器感知。然而,由于这些方法使用固定且与数据无关的惩罚机制,它们往往难以在具有不同标签和向量分布的数据集上泛化。本文提出一种基于原理的替代方案,直接从数据中学习向量距离与过滤器匹配之间的最优权衡,而非依赖固定惩罚。我们将此问题形式化为约束线性优化问题,推导出能更好反映底层过滤器分布并更有效解决带过滤近似最近邻搜索问题的权重。这些学习到的权重同时指导搜索过程和索引构建,从而形成能更有效捕捉底层过滤器分布及语义的图结构。实验表明,根据数据自适应调整距离函数相比固定惩罚方法可将准确率提升5-10%,为带过滤近似最近邻搜索问题提供了更灵活且可泛化的框架。