Department of Computer Engineering
S E M I N A R
Moving Beyond Term-Frequency Models for Text Categorization
Text categorization is a major application in machine learning research due to the increasing demand for autonomous organization of digital documents on the Internet. Integration of semantic information into text classification is highly desirable, but the current term-frequency models do not seem adequate. We explore recent theoretical developments for more detailed models of text dissimilarity, in particular the domain inspecific metrics from algorithmic information theory (Kolmogorov complexity). We outline syntactic kernels for text dissimilarity. One of our kernels is scalable in the number of documents, and we have shown it to be superior to a monogram term-frequency model of Turkish documents from a fuzzy and imbalanced dataset consisting of articles from Radikal newspaper. We discuss the advantages and limitations of our generalized syntactical approach. The algorithmic distance functions are fairly independent of the language, can detect sentence level similarity, and they look particularly fitting for analysis of agglutinative languages such as Turkish and Finnish. On the other hand, our current methods are purely syntactic and this places inherent limitations on the precision of classification. We can move even further from the stemming approach by using word/sense disambiguation tools and commonsense knowledge bases such as WordNet and ConceptNet. Filtered through a worse/sense disambiguator, we obtain sense strings, which can be processed by either classical term-frequency models, with well-known string kernels used in bioinformatics and NLP research, or with our advanced kernels. A commonsense knowledge base can help us reason about dissimilarity in semantic terms, instead of syntactic terms. Finally, we mention the relevance of kernel feature selection and multilevel statistics for large-scale text categorization. In particular, we summarize approaches used in the open source semantic search engine ALVIS, and argue that semantic search will eventually lead us to abandon the well studied term-frequency models.
DATE: November 22, 2004, Monday @ 15:40