Bilkent University
Department of Computer Engineering


Codet: A New Algorithm for Containment and Near Duplicate Document Detection in Text Corpora


Emre Varol
MSc Student Student
Computer Engineering Department
Bilkent University

In this thesis, we investigate containment detection, which is a generalized version of the well known near-duplicate detection problem concerning whether a document is a subset of another document. In text-based applications, there are three way of observing document containment: exact-duplicates, near-duplicates, or containments, where first two are the special cases of containment. To detect containments, we introduce CoDet, which is a novel algorithm that focuses particularly on containment problem. We also compare its performance with four well-known near duplicate detection methods (DSC, full fingerprinting, I-Match, and SimHash) that are adapted to containment detection. Our algorithm is especially suitable for streaming news. It is also expandable to different domains. Experimental results show that CoDet mostly outperforms the other algorithms and produces remarkable results in detection of containments in text corpora.

Keywords: Corpus Tree, Document Containment, Near-Duplicate Detection, Similarity, Test Collection Preparation, Algorithm


DATE: 25 January, 2012, Wednesday @ 16:15