Bilkent University
Department of Computer Engineering


New Event Detection And Tracking in Turkish


Suleyman Kardas
Computer Engineering Department
Bilkent University

The amount of information and the number of information resources on the Internet have been growing with a high speed for over a decade. This is also true for on-line news and news providers. The overcome information overload news consumers prefer to track the topics that they are interested in. Topic detection and tracking (TDT) applications aim to organize the temporally ordered stories of a news stream according to the events. Two major problems in TDT are new event detection (NED) and topic tracking (TT). The focus of these problems is on finding the first stories of previously unseen new events and all subsequent stories on a certain topic defined by a small number of initial stories. In this thesis, the NED and TT problems are investigated in detail using the first large-scale BilCol2005 test collection developed by the Bilkent Information Retrieval Group. The collection contains 209,305 documents from the entire year of 2005 and involves several events in which eighty of them are annotated by humans. The experimental results on the BilCol2005 show that a simple word truncation stemming method can statistically compete with a sophisticated stemming approach that pays attention to the morphological structure of the language. Our statistical findings also illustrate that word stopping and the contents of the associated stopword list are important and removing them from content can affect performance of the system. Since the decision for each incoming story is done before processing next story, we introduce threshold concept to the NED and TT systems. We demonstrate that the confidence scores of two different similarity measures can be combined in a straightforward manner for improving the effectiveness.


DATE: 25 May, 2009, Monday@ 14:30