Bilkent University
Department of Computer Engineering


Aggregate Profile Clustering via Distributed Stream Processing


Mehmet Ali Abbasoglu
MSc Student
Computer Engineering Department
Bilkent University

We present an approach for clustering profiles that are incrementally maintained over a stream of updates. The goal is to maintain profile clusters for business intelligence, so that customers with similar behaviors can be grouped together. Such clusters could be used to create group models that are broader than individual customer models, and can be used for various data analysis purposes. Our approach operates in an online manner to accommodate changes in data characteristics and resource availability. Furthermore, maintaining clusters on a single machine may not be possible, especially if the profiles are large in terms of size or if the cost to process each profile update is high. To address this, our approach scales to handle 'big data'. This is achieved by applying partitioned stateful parallelism using a distributed stream processing middleware. In particular, we partition the incoming stream over a set of processing hosts using the customer id as the partitioning key and have each host process its portion of the sub-stream, maintaining a subset of the clusters and the associated state. We use adaptive partitioning techniques to balance the memory and communication costs of different hosts, while maintaining high fidelity clusters and minimizing the migration overheads.


DATE: 19 November, 2012, Monday @ 15:30