Find in Library
Search millions of books, articles, and more
Indexed Open Access Databases
Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree
oleh: Hyeong-Cheol Ryu, Sungwon Jung
| Format: | Article |
|---|---|
| Diterbitkan: | IEEE 2020-01-01 |
Deskripsi
Clustering exceptionally large data sets is becoming a major challenge in data analytics with the continuous increase in their size. Summary-based clustering methods and distributed computing frameworks such as MapReduce can efficiently handle this challenge. These methods include BIRCH and its extension CF<sup>+</sup>-ERC. CF<sup>+</sup>-ERC can reduce the clustering time of large data sets by utilizing the structure of a CF<sup>+</sup> tree. However, CF<sup>+</sup>-ERC is a sequential clustering method, so it cannot be used with multiple machines to reduce the clustering time. In this study, we propose a novel MapReduce-based distributed clustering method called CF<sup>+</sup>-ERC on MapReduce (CF<sup>+</sup>ERC_MR). It builds a CF<sup>+</sup> tree for clustering an exceptionally large data set with a given threshold and finds the final clusters using MapReduce, which significantly reduces the clustering time. Further, our method is scalable with respect to the number of machines. The efficacy of this method is validated through not only its theoretical analysis but also in-depth experimental analysis of exceptionally large synthetic and real data sets. The experimental results demonstrate that the clustering speed of our approach is far superior to that of the existing clustering methods.