Bilkent University
Department of Computer Engineering


Near-Duplicate News Detection Using Named Entities


Erkan Uyar
Computer Engineering Department
Bilkent University

The number of web documents has been increasing in an exponential manner for more than a decade. In a similar way, partially or completely duplicate documents appear frequently on the Web. Advances in the Internet technologies have also increased the number of news agencies. People tend to read news from news portals that aggregate documents from different sources. The existence of duplicate or near-duplicate news in these portals is a common problem. Duplicate documents create redundancy and only a few users may want to read news containing identical information. Duplicate documents decrease the efficiency and effectiveness of search engines. In this thesis, we propose and evaluate a new near-duplicate news detection algorithm: Tweezer. In this algorithm, named entities and the words that appear before and after them are used to create document signatures. Documents sharing the same signatures are considered as near-duplicate. For named entity detection, we introduce a method called Turkish Named Entity Recognizer, TuNER. For evaluation of Tweezer, a document collection is created using news articles obtained from Bilkent News Portal. In the experiments, Tweezer is compared with I-Match, which is a state-of-the-art near-duplicate detection algorithm that creates document signatures using Inverse Document Frequency, IDF, values of terms. It is experimentally shown that the effectiveness of Tweezer is statistically significantly better than that of I-Match by using a cost function that combines false alarm and miss rate probabilities. Furthermore, Tweezer is at least 17% faster than I-Match.


DATE: 22 May, 2009, Friday@ 10:40