Subset sampling (also known as Poisson sampling), where the decision to include any specific element in the sample is made independently of all others, is a fundamental primitive in data analytics, enabling efficient approximation by processing representative subsets rather than massive datasets. While sampling from explicit lists is well-understood, modern applications -- such as machine learning over relational data -- often require sampling from a set defined implicitly by a relational join. In this paper, we study the problem of \emph{subset sampling over joins}: drawing a random subset from the join results, where each join result is included independently with some probability. We address the general setting where the probability is derived from input tuple weights via decomposable functions (e.g., product, sum, min, max). Since the join size can be exponentially larger than the input, the naive approach of materializing all join results to perform subset sampling is computationally infeasible. We propose the first efficient algorithms for subset sampling over acyclic joins: (1) a \emph{static index} for generating multiple (independent) subset samples over joins; (2) a \emph{one-shot} algorithm for generating a single subset sample over joins; (3) a \emph{dynamic index} that can support tuple insertions, while maintaining a one-shot sample or generating multiple (independent) samples. Our techniques achieve near-optimal time and space complexity with respect to the input size and the expected sample size.
翻译:子集采样(亦称泊松采样),即每个特定元素被纳入样本的决策独立于所有其他元素,是数据分析中的一项基本原语,通过处理代表性子集而非海量数据集实现高效近似。尽管从显式列表中进行采样已有深入研究,但现代应用——例如关系数据上的机器学习——通常需要从由关系连接隐式定义的集合中进行采样。本文研究\emph{连接操作上的子集采样}问题:从连接结果中抽取随机子集,其中每个连接结果以特定概率独立地被包含。我们处理一般性场景,其中概率通过可分解函数(如乘积、求和、最小值、最大值)从输入元组权重导出。由于连接结果规模可能相对于输入呈指数级增长,物化所有连接结果以执行子集采样的朴素方法在计算上不可行。我们针对无环连接提出了首个高效算法:(1)用于在连接上生成多个(独立)子集样本的\emph{静态索引};(2)用于在连接上生成单次子集样本的\emph{单次生成}算法;(3)可支持元组插入、同时维护单次样本或生成多个(独立)样本的\emph{动态索引}。我们的技术在输入规模和期望样本大小方面实现了近乎最优的时间和空间复杂度。