Department of Computer Engineering
Efficient Parallel Frequency Mining based on
a Novel Top-Down Partitioning Scheme for Transactional Data
M.S. in Computer Engineering
Supervisor: Prof. Dr. Cevdet Aykanat
In recent years, large quantities of data have been amassed with advances in data acquisition capabilities. Automated detection of useful information is required for vast data obtained from scientific and business domains. Data Mining is the application of efficient algorithmic solutions on a variety of immense data for such knowledge discovery.
Frequency mining discovers all frequent patterns in a transaction or relational database and it comprises the core of several data mining algorithms such as association rule mining and sequence mining. Frequency pattern discovery has become a challenge for parallel programming since it is a highly complex operation on huge datasets demanding efficient and scalable algorithms.
In this thesis, we propose a new family of parallel frequency mining algorithms. We introduce a novel transaction set partitioning scheme that can be used to divide the frequency mining task in a top-down fashion. The method operates on the graph of frequent patterns with length two (GF2) from which a graph partitioning by vertex separator (GPVS) is mapped to a two-way partitioning on the transaction set. The two parts obtained can be mined independently and therefore can be utilized for concurrency. In order for this property to hold, there is an amount of replication dictated by the separator in GF2, which is minimized by the GPVS algorithm. A k-way partitioning is derived from recursive application of $2$-way partitioning scheme which is used in the design of a generic parallel frequency mining algorithm. First we compute GF2 in parallel, succeeding that we designate a $k$-way partitioning of the database for $k$ processors with a parallel recursive procedure. The database is redistributed such that each processor is assigned one part. Subsequent mining proceeds simultaneously and independently at each processor with a given serial mining algorithm. A complete implementation in which we employ FP-Growth as the sequential algorithm has been achieved. The performance study of the algorithm on a Beowulf system demonstrates favorable performance for synthetic databases. For hard instances of the problem, we have gained approximately twice the speedup of a state-of-the-art algorithm.
We also present a correction and optimization to FP-Growth algorithm.
DATE: January 31, 2002, Thursday @ 14:00