使用 Hadoop 和 Spark 高效大文本数据集成 (Efficient Big Text Data Clustering Algorithms using Hadoop and Spark)

Document clustering is a traditional, efficient and yet quite effective, text mining technique when we need to get a better insight of the documents of a collection that could be grouped together. The K-Means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm are two of the most known and commonly used clustering algorithms; the former due to its low time cost and the latter due to its accuracy. However, even the use of K-Means in text clustering over large-scale collections can lead to unacceptable time costs. In this paper we first address some of the most valuable approaches for document clustering over such 'big data' (large-scale) collections. We then present two very promising alternatives: (a) a variation of an existing K-Means-based fast clustering technique (known as BigKClustering - BKC) so that it can be applied in document clustering, and (b) a hybrid clustering approach based on a customized version of the Buckshot algorithm, which first applies a hierarchical clustering procedure on a sample of the input dataset and then it uses the results as the initial centers for a K-Means based assignment of the rest of the documents, with very few iterations. We also give highly efficient adaptations of the proposed techniques in the MapReduce model which are then experimentally tested using Apache Hadoop and Spark over a real cluster environment. As it comes out of the experiments, they both lead to acceptable clustering quality as well as to significant time improvements (compared to K-Means - especially the Buckshot-based algorithm), thus constituting very promising alternatives for big document collections.

翻译：文件群集是一种传统、高效和相当有效的文本挖掘技术,当我们需要更好地了解可归集的收藏文件时,我们需要更好地了解可归集的文件文件。 K- Means 算法和等级组合组合法(HAC)算法是两种最已知和最常用的群集算法;前者是因为时间成本低,而后者是因为其准确性;然而,即使使用K-Means在大规模收藏的文本群集中使用K-Means,也可能导致不可接受的时间成本。在本文中,我们首先讨论一些最有价值的方法,在“大数据”(大比例)收藏中进行文件集集集集。我们然后特别提出两个非常有希望的替代方法:(a) 基于K-Means的快速组合算法的变换,以便用于文件群集的低成本,以及(b)基于定制版本的巴克肖特算算算算算算法的混合集方法,首先在输入数据集的样本中应用等级组合程序,然后将成果作为K-Means 最初的中心中心, 用来进行高比级的K- Means 快速的滚算方法,然后用来进行高额的滚算。