Bilkent University
Department of Computer Engineering


A Cluster-Based External Plagiarism and Parallel Corpora Detection Method


Ceyhun Karbeyaz
MSc. Student
Computer Engineering Department
Bilkent University

Today different translations of the same literary text can be found. Intuitively such translations that are based on the same literary text are expected to possess significantly similar structure. In the same way, it is possible that a text that is suspected to have plagiarism can possess structural similarities with the text that is believed to be the source of the plagiarism. Textual plagiarism implies the usage of an author's text, his/her work or the idea that is inserted in another textual work without giving a reference or without taking the permission of the original text's author. Today, existing intrinsic and external automatically plagiarism detection methods tend to detect plagiarism cases within a given dataset in order to run these algorithms in a reasonable amount of time. Hence a reference document set is built in order to search for plagiarism cases successfully by these algorithms.

In this thesis a method for detecting and quantifying the external plagiarism and parallel corpora is introduced. For this purpose we use the structural similarities in order to analyze plagiarism detection problem and to quantify the similarity between the structurally similar texts. In this method, suspicious and source texts are partitioned into corresponding blocks. Each block is represented as a group of documents where a document consists of a fixed amount of words. Then, blocks are indexed and clustered by using the cover coefficient clustering algorithm. Cluster formations for both texts are then analyzed and their similarities are measured. The results over PAN'09 plagiarism dataset and over different versions of the famous literary text classic Fuzuli's Leyla and Mecnun show that the proposed method successfully detects and quantifies the structurally similar plagiarism cases and succeeds in detecting the parallel corpora.


DATE: 21 July, 2011, Thursday @ 9:30