PARALLEL SEQUENCE MINING ON DISTRIBUTED-MEMORY SYSTEMS

Embiya KARAPINAR

M.S. in Computer Engineering

Supervisor: Asst. Prof. Dr. Atilla Gursoy

February 23, 2001 at 09:40 in EA 502

 

ABSTRACT

 

Discovering all the frequent sequences in very large databases is a time consuming task. However, large databases forces to partition the original database into chunks of data to process in main-memory. Most current algorithms require as many database scans as the longest frequent sequences. Spade is a fast algorithm which reduces the number of database scans by using lattice-theoretic approach to decompose original problem into small pieces (equivalence classes) which can be processed in main-memory independently.

In this thesis work, we present ,dSpade, a parallel algorithm based on spade for discovering the set of all frequent sequences, targeting distributed-memory systems. In dSpade, horizontal database partitioning method is used, where each processor stores equal number of customer transactions. dSpade is a synchronous algorithm for discovering frequent 1-sequences F1 and frequent 2-sequences F2. Each processor performs the same computation on its local data to get local support counts and broadcasts the results to other processors to find global frequent sequences during F1 and F2 computation. After discovering all F1 and F2, all frequent sequences are inserted into lattice to decompose the original problem into equivalence classes. Equivalence classes are mapped in a greedy heuristic to least loaded processors in a round-robin manner. Finally, each processor asynchronously begins to compute Fk on its mapped equivalence classes to find all frequent sequences. We present results of performance experiments conducted on a 32-node Beowulf Cluster. Experiments show that dSpade delivers good speedup and scales linearly in the database size.

Keywords: Lattice, equivalence class,horizontal database partitioning method.