Bilkent University
Department of Computer Engineering


Cluster Labeling Improvement by Utilizing Data Fusion and Wikipedia


Gökçe Ayduğan
MS Student
(Supervisor: Prof. Dr. Fazlı Can)
Computer Engineering Department
Bilkent University

A cluster is a set of related documents. Cluster labeling is the process of assigning descriptive labels to clusters. This study investigates several cluster labeling approaches and presents two novel methods. In the first approach, we use clusters themselves and extract important terms, which distinguish clusters from each other, with different statistical feature selection methods. We then apply different data fusion approaches among the results of feature selection methods. Our results show that although fusion approach provides statistically significantly better results for some cases, it is not a stable and reliable labeling approach. This can be explained by the fact that a good label may not occur in the cluster at all. In the second approach, we exploit Wikipedia as an external resource and use its anchor texts and categories to enrich the label pool. We observe that the use of anchor texts fails because the suggested labels tend to focus on minor topics. Although the suggested anchor texts are related to the main topic, they do not exactly describe it. After this observation, we use categories of Wikipedia pages to enrich our label pool. The novelty of this approach is that we retrieve Wikipedia pages by looking at their relatedness to the clusters and use their categories. The experimental results show that it provides statistically significantly better results than the other cluster labeling approaches that we examine in this study.

Keywords: Cluster Labeling, Data Fusion, Wikipedia


DATE: 27 July 2017, Thursday @ 15:00