论文部分内容阅读
层次聚类(Hierarchical Clustering)就是通过对数据集按照某种方法进行层次分解。该聚类方法可以设定聚类的个数,并得到了各个研究和应用领域的广泛应用。煤炭产业中往往希望对煤炭的产品进行聚类分析,从而有助于开发和生产。随着煤炭系统中收集的煤炭数据数量的增多,层次聚类算法由于需要计算大量的相似性矩阵需要大量的内存,原有的层次聚类算法不能有效地处理海量规模数据。文章针对煤炭数据中生成的大规模数据,提出基于云计算平台的分布式层次聚类算法,该算法能够分布式完成相似性矩阵的保存和计算,快速、准确地完成层次聚类工作。在实验部分通过2组实验证明了算法具有很高的效率以及很高的可扩展性。
Hierarchical Clustering (Hierarchical Clustering) is through the data set in accordance with some method of hierarchical decomposition. The clustering method can set the number of clusters and has been widely used in various fields of research and application. The coal industry often wants to cluster the coal products to help develop and produce. As the number of coal data collected in the coal system increases, the hierarchical clustering algorithm needs a large amount of memory due to the need to calculate a large number of similarity matrices, and the original hierarchical clustering algorithm can not effectively process the large-scale data. In this paper, a distributed hierarchical clustering algorithm based on cloud computing platform is proposed for large-scale data generated in coal data. This algorithm can accomplish the preservation and calculation of similarity matrix distributedly and quickly and accurately. Experimental results show that the algorithm has high efficiency and high scalability.