Data Decomposition Thechniques for
Parallel Tree-Based K-Means Clustering
The main computation in the k-means clustering is distance calculations between cluster centroids and patterns. As the number of the patterns and the number of centroids are increased, time needed to complete computations increased. This computational load requires high performance computers and/or algorithmic improvements. The parallel tree-based k-means algorithm on distributed memory machines combines the algorithmic improvements and high computation capacity of the parallel computers to deal with huge datasets. Its performance is affected by the data decomposition technique. In this thesis, we presented novel data decomposition technique to improve the performance of the parallel tree-based k-means algorithm on distributed memory machines. Proposed tree-based decomposition techniques try to decrease the total number of the distance calculations by assigning processors compact subspaces. The compact subspace improves the performance of the pruning function of the tree-based k-means algorithm. We have implemented the algorithm and have conducted experiments on the PC Cluster that is established at Computer Engineering Department of Bilkent University. Our experimental results demonstrated that the tree-based decomposition technique outperforms the random decomposition and stripwise decomposition techniques.
DATE: July 17, 2002, Wednesday @ 09:30