论文部分内容阅读
数据的快速增长,为我们提供了更多的信息,然而,也对传统信息获取技术提出了挑战。这篇论文提出了MCMM算法,它是基于MapReduce的大规模数据分类模型的最小生成树(MST)的算法。它可以看做是介于传统的KNN方法和基于聚类分类方法之间的模型,旨在克服这两种方法的不足并能处理大规模的数据。在这一模型中,训练集作为有权重的无向完全图来处理。顶点是对象,两点之间边的权重是对象间的距离。这一距离,不同于欧几里得距离,它是一个特定的距离度量。这样,可以找到图中最小生成树集,其中,图中每棵树代表一个类。为了降低时间复杂度,提取了每棵树中最具代表性的点来代表该树。这些压缩了的点集,可以通过计算无标签对象和它们之间的距离,来进行分类。MCMM模型基于MapReduce实现并且部署在Hadoop平台。该模型可扩展处理大规模的数据,是因为Hadoop支持数据密集分布应用,并且这些应用可以和数以千计的节点和数据一起运作。另外,MapReduce和Hadoop能在由商品机组成的集群上很好的运行。MCMM模型使用云平台并且通过使用MapReduce和Hadoop进行云计算是有益处的。实验采用的数据集包括从UCI数据库得到的真实数据和一些模拟数据,实验使用了4000个集群。实验表明,MCMM模型在精确度和扩展性上优于KNN和其他一些经常使用的基础分类方法。
The rapid growth of data provides us with more information, however, but also poses challenges to traditional access to information technologies. This paper presents the MCMM algorithm, which is an algorithm of minimum spanning tree (MST) based on MapReduce’s large-scale data classification model. It can be regarded as a model between the traditional KNN method and the clustering-based classification method to overcome the shortcomings of both methods and to deal with large-scale data. In this model, the training set is treated as a weighted undirected complete graph. Vertex is the object, the weight between two points is the distance between the objects. This distance, unlike the Euclidean distance, is a measure of distance. In this way, you can find the minimum spanning tree set in the graph, where each tree in the graph represents a class. To reduce the time complexity, the most representative point in each tree is extracted to represent the tree. These compressed sets of points can be categorized by calculating the distance between unlabeled objects and their distance. The MCMM model is based on MapReduce and deployed on the Hadoop platform. The model scales to large-scale data because Hadoop supports data-intensive distributed applications and these applications can work with thousands of nodes and data. In addition, MapReduce and Hadoop can run well on a cluster of commodity machines. It is good for the MCMM model to use a cloud platform and cloud computing using MapReduce and Hadoop. The data set used in the experiment includes real data and some simulated data obtained from the UCI database. The experiment used 4000 clusters. Experiments show that the MCMM model is superior to KNN and other frequently used basic classification methods in accuracy and scalability.