Department of Computer Engineering
S E M I N A R
Containment Detection in News Corpora
Computer Engineering Department
We study a generalized version of the near-duplicate detection problem which concerns whether a document is a subset of another document. In text-based applications, document containment can be observed in exact-duplicates, near-duplicates, or containments, where the first two are special cases of the third. We introduce a pioneering method, called CoDet, which focuses particularly on this problem, and compare its performance with four well-known near-duplicate detection methods (DSC, full fingerprinting, I-Match, and SimHash) that are adapted to containment detection. Our method is expandable to different domains, and especially suitable for streaming news. Experimental results show that CoDet outperforms all other methods by efficiently producing remarkable results.
DATE: 25 April, 2011, Monday @ 16:15