Bilkent University
Department of Computer Engineering


Sentence Based Topic Modeling


Can Taylan Sarı
MSc Student
Computer Engineering Department
Bilkent University

Vast amount of plain texts has accumulated in digital media since first article was typed on computer. Classification, extracting information, finding short descriptions, topics, from raw texts, instead of retrieving information manually, are seemed as vital needs on many fields; cognitive science, natural language processing, commercial uses, etc.

Topic Models(TM) is a family of probabilistic models those find out hidden topics, occur in documents in terms of word frequencies appear. There are several algorithms extract topic from text documents; Latent Semantic Indexing(LSI), Probabilistic Latent Semantic Indexing/Analysis(PLSA/PLSI), Latent Dirichlet Allocation(LDA) and more. LDA is the most favorite algorithm which is a Bayesian graphical model, discovers a set of topics for each document representing as a mixture distribution of words. It differs from previous methods with usage of Dirichlet priors for topic distributions.

All methods mentioned (LSI, PLSA, LDA) are tightly coupled to the bag-of-words paradigm, which discard the semantic structure and word/sentence alignments in the document. Assuming documents as unstructured word sets cannot infer words in same sentence and sentences in same paragraph are strictly close to each other semantically.

Words build up sentences, sentences to paragraphs, paragraphs to documents and so on. As sentences are natural boundaries of words in different topics, so paragraphs for sentences. In this concept, neighbor sentences must have identical/similar topics and this similarity must be weakened as long as the distance between increased.

In our model, each sentence are assumed as a word set, has same latent topic variable for each word and topics of sentences in row are modeled as a Hidden Markov Model (HMM). Therefore, sentences can describe the strong statistical dependence of its own words and this HMM can picture the evolution of common topics of the word sets. At the latter end, we expect much better representations of gists of documents.


DATE: 11 March, 2013, Monday @ 16:10