Bilkent University
Department of Computer Engineering
S E M I N A R

 

Incremental Clustering of Heterogeneous Datasets Using Robust Gaussian Models

 

Caner Mercan
MSc Student
Computer Engineering Department
Bilkent University

Cluster analysis can mainly be described as the methods or algorithms for grouping, or clustering, objects without category labels whose presence imply identifiers for objects, such as class labels. Clustering methods, in a broader approach, can be divided into several categories: hierarchical, graph theory-based, fuzzy, probability model-based approaches and many more. In probability model-based approach, the data is assumed to follow a mixture model of probability distributions, thus allowing mixture likelihood approaches to be used for clustering. In probability model-based approach, the data is assumed to follow a mixture model of probability distributions, thus allowing mixture likelihood approaches to be used for clustering. Classical maximum likelihood estimator (MLE) is used to estimate the parameters in the mixture of distributions via EM. However, when dataset is heterogeneous, containing both inliers and outliers, we cannot model only the inliers and leave out the outliers with the classical approach. In order to overcome this, robust variations of the MLE have been proposed. However, these kinds of approaches depend on the knowledge of which points are inliers and which are not. Additionally, greedy learning schemes are incorporated into learning Gaussian Mixture Models (GMM), a specific type of mixture models, which does not require the optimal number of components to be known. However, in these works, robustness against outliers has not been taken into account, thus they fail greatly on any heterogeneous datasets. We present an incremental clustering approach using robust Gaussian Models in which the number of components need not be known which corresponds to the number of distinct classes in the dataset and the model estimation is only dependent of the approximate number of inliers, regardless of the knowledge of which sample is inlier or not.

 

DATE: 01 April, 2013, Monday @ 16:10
PLACE: EA409