SEMINAR

DEPARTMENT OF COMPUTER ENGINEERING

ABSTRACT

APPLICATION OF FEATURE PROJECTION BASED TEXT CATEGORIZATION ALGORITHM ON THE TURKISH DATASET

Ufuk İlhan

M.S. in Computer Engineering

Supervisor: Assoc. Prof. Halil Altay Güvenir

February 21, 2001 at 14:40 in EB267

 

This thesis presents compilation of a Turkish dataset, called Anadolu Agency Newsgroup in order to study in Text Categorization. Nearly all researchers have been concerned with English or with languages morphologically similar to English. In such languages, words contain only a small number of affixes, or none at all, almost all of parsing models for them consider recognizing those affixes as being trivial, and thus do not make morphological analyses. This feature allows easy stemming of the words to find their root words. On the other hand, agglutinative languages as Turkish, words contain no direct indication where the morpheme boundaries are, and furthermore morphemes take a shape dependent on the morphological and phonological context. In Turkish the process of adding one suffix to another can result in relatively long words, which often contain an amount of semantic information equivalent to a whole English phrase, clause or sentence. Due to this complex morphological structure, a single Turkish word can give rise to a very large number of variants. Therefore, Turkish requires text processing techniques different than English and similar languages.

This thesis also presents the evaluation and comparison of the well-known k-NN classification algorithm and a variant of the k-NN, called Feature Projection Text Categorization (FPTC) algorithm which is based on the idea of representing training instances as their projections on each feature

dimension.

Keywords: text categorization, classification, feature projections, stemming, wild card matching, stopword.