高效率地探索RDF图中有趣的总量 (Efficient Exploration of Interesting Aggregates in RDF Graphs)

from arxiv, Accepted for publication in proceedings of the 2021 International Conference on Management of Data (SIGMOD '21), June 20--25, 2021, Virtual Event, China

As large Open Data are increasingly shared as RDF graphs today, there is a growing demand to help users discover the most interesting facets of a graph, which are often hard to grasp without automatic tools. We consider the problem of automatically identifying the k most interesting aggregate queries that can be evaluated on an RDF graph, given an integer k and a user-specified interestingness function. Our problem departs from analytics in relational data warehouses in that (i) in an RDF graph we are not given but we must identify the facts, dimensions, and measures of candidate aggregates; (ii) the classical approach to efficiently evaluating multiple aggregates breaks in the face of multi-valued dimensions in RDF data. In this work, we propose an extensible end-to-end framework that enables the identification and evaluation of interesting aggregates based on a new RDF-compatible one-pass algorithm for efficiently evaluating a lattice of aggregates and a novel early-stop technique (with probabilistic guarantees) that can prune uninteresting aggregates. Experiments using both real and synthetic graphs demonstrate the ability of our framework to find interesting aggregates in a large search space, the efficiency of our algorithms (with up to 2.9x speedup over a similar pipeline based on existing algorithms), and scalability as the data size and complexity grow.

翻译：随着大型开放数据作为今天的RDF图表日益共享,人们日益要求帮助用户发现图表中最有趣的方面,这些方面往往难以在没有自动工具的情况下加以理解。我们考虑到自动确定可在RDF图形上评估的最有趣的总询问问题的问题,考虑到一个整数 k 和一个用户指定的有趣功能。我们的问题不同于对关系数据仓库的分析,因为(一) 在RDF图表中,我们没有给出我们,但我们必须查明候选综合数据的事实、尺寸和计量;(二) 利用真实和合成的图表来有效评估在面临RDF数据中多价值层面时出现的多总量断裂的典型方法。在这项工作中,我们提议了一个可以扩展的端对端对端框架,以便能够根据新的RDFF可兼容的单方算法来识别和评估有趣的总汇总,以便有效地评估总数据总和新颖的早期技术(有可靠的保证),这种技术可以不受候选人兴趣的影响。用真实的和合成的图表来试验我们框架在面临RDFDF数据多值的维值时,我们框架能够找到一个令人感兴趣的总算的总数,在大规模搜索速度上找到一个有趣的总算法,以及以现有空间总算法的复杂程度为基础,以现有的总算算算。