Sampling from multiple distributions so as to maximize overlap has been studied by statisticians since the 1950s. Since the 2000s, such correlated sampling from the probability simplex has been a powerful building block in disparate areas of theoretical computer science. We study a generalization of this problem to sampling sets from given vectors in the hypersimplex, i.e., outputting sets of size (at most) some $k$ in $[n]$, while maximizing the sampled sets' overlap. Specifically, the expected difference between two output sets should be at most $α$ times their input vectors' $\ell_1$ distance. A value of $α=O(\log n)$ is known to be achievable, due to Chen et al.~(ICALP'17). We improve this factor to $O(\log k)$, independent of the ambient dimension~$n$. Our algorithm satisfies other desirable properties, including (up to a $\log^* n$ factor) input-sparsity sampling time, logarithmic parallel depth and dynamic update time, as well as preservation of submodular objectives. Anticipating broader use of correlated sampling algorithms for the hypersimplex, we present applications of our algorithm to online paging, offline approximation of metric multi-labeling and swift multi-scenario submodular welfare approximating reallocation.
翻译:自20世纪50年代以来,统计学家已开始研究从多个分布中采样以最大化重叠的问题。自21世纪初以来,从概率单纯形中进行此类相关采样已成为理论计算机科学多个不同领域中的关键构建模块。本文研究该问题在超单纯形上的推广:从给定向量中采样集合,即输出大小为(至多)$k$的$[n]$子集,同时最大化采样集合之间的重叠。具体而言,两个输出集合的期望差异应不超过其输入向量$\ell_1$距离的$α$倍。Chen等人(ICALP'17)已证明$α=O(\log n)$是可实现的。我们将该因子改进为$O(\log k)$,使其独立于环境维度$n$。我们的算法满足其他理想性质,包括(在$\log^* n$因子内)输入稀疏性采样时间、对数级并行深度与动态更新时间,以及保持子模目标函数。为预见超单纯形相关采样算法的更广泛应用,我们展示了算法在在线分页、度量多标签离线近似以及快速多场景子模福利近似重分配中的应用。