Bilkent University
Department of Computer Engineering


Linear Text Segmentation by Topic


Hayrettin Erdem
MSc Student
Computer Engineering Department
Bilkent University

Recently there has been interest in tracking topics in lengthy uninterrupted texts. Transcripts of news broadcasts and business meetings are such examples. Such texts tend to be terse with low word repetition and carry no indication of boundaries. The aim is to linearly discover topic boundaries and obtain topically coherent segments. The task can also be appropriate in passage retrieval and text summarization. We introduce a new methodology which takes some study as a start-up point. The data set consists of transcripts of TRT news broadcasts. We apply traditional preprocessing steps to a candidate document. Then we construct an intrasimilarity matrix that holds the correlation of each pair of words appeared in the text. Next, we create dotplots to visualize data from the matrix and to find discourse boundaries. We visually observed and compared the dotplot of our system with that of the baseline system. It seems that our system will give more effective results than the baseline system.

Keywords: Text segmentation, topic segmentation, topic boundary, dotplot


DATE: 07 May, 2012, Monday @ 15:40