Bilkent University
Department of Computer Engineering


Local Context Based Linear Text Segmentation


Hayrettin Erdem
MSc Student
Computer Engineering Department
Bilkent University

Understanding the topical structure of text documents is important for effective retrieval and browsing, automatic summarization, and tasks related to identifying, clustering and tracking documents about their topics. Despite documents often display structural organization and contain explicit section markers, some lack of such properties thereby revealing the need for topical text segmentation systems. Examples of such documents are speech transcripts and inherently unstructured texts like newspaper columns and blog entries discussing several subjects in a discourse. A novel local-context based approach depending on lexical cohesion is presented for linear text segmentation, which is the task of dividing text into a linear sequence of coherent segments. As the lexical cohesion indicator, the proposed technique exploits relationships among terms induced from semantic space called HAL (Hyperspace Analogue to Language), which is built upon by examining co-occurrence of terms through passing a fixed-sized window over text. The algorithm iteratively discovers topical shifts by examining the most relevant sentence pairs in a block of sentences considered at each iteration. The proposed technique is evaluated on both error-free speech transcripts of news broadcasts and documents formed by concatenating different topical regions of text. A new corpus for Turkish is automatically built where each document is formed by concatenating different news articles. For performance comparison, two state-of-the-art methods, TextTiling and C99, are leveraged, and the results show that proposed approach has comparable performance with these two techniques.

Keywords: Text Segmentation, Topic Segmentation, Lexical Cohesion, Semantic Relatedness


DATE: 21 February, 2014, Friday @ 15:40