Bilkent University
Department of Computer Engineering


Sentence Based Topic Modeling


Can Taylan Sarı
MSc Student
Computer Engineering Department
Bilkent University

Fast augmentation of large collections of texts in digital world makes it inevitable to automatically extract short descriptions of those texts. A lot of studies have been done on detecting hidden topics in text corpora, but almost all models assume that each text is a bag of words. This study presents a new unsupervised topic model that pays more attention to text structure. The texts in the corpora are described by a generative graphical model, in which each sentence is generated by a topic and topics of consecutive sentences follow a hidden Markov chain. In contrast to bag-of-words paradigm, the model assumes that semantic terms in the same sentence are related to the same topic, and topics of successive sentences build on a memory slowly changing in a meaningful way as the text flows.

The results are evaluated both qualitatively by examining topic keywords from particular text collections and quantitatively by means of perplexity, a generalization measure of the model.


DATE: 16 January, 2013, Thursday @ 14:00