SEMINAR

DEPARTMENT OF COMPUTER ENGINEERING

ABSTRACT

Categorization in a Hierarchically Structured Text Database

Ferhat Kutlu

M.S. in Computer Engineering

Over the past two decades there has been a huge increase in the amount of data being stored in databases and flow of data on-line by the effects of improvements in Internet. This huge increase brought out the needs for intelligent tools to manage that size of data and its flow. Hierarchical approach is the best way to satisfy these needs and it is so widespread among people dealing with databases and Internet. Usenet newsgroups system is one of the on-line databases that have built-in hierarchical structures. Our departure point is this hierarchical structure which makes categorization tasks easier and faster. Actually most of the search engines in Internet also exploit inherent hierarchy of Internet. Growing size of data makes most of the traditional categorization algorithms obsolute. Thus we developed a brand-new categorization learning algorithm which constructs an index tree out of Usenet news database and then decides the related newsgroups of a new news by categorizing it over the index tree. In learning phase it has an agglomerative and bottom-up hierarchical approach. In categorization phase it does an overlapping and supervised categorization. k Nearest Neighbor categorization algorithm is used to compare the complexity measure and accuracy of our algorithm. This comparison does not only mean comparing two different algorithms but also means comparing hierarchical approach vs. flat approach, similarity measure vs. distance measure and importance of accuracy vs. importance of speed. Our algorithm prefers hierarchical approach and similarity measure, and greatly outperforms k Nearest Neighbor categorization algorithm in speed in cost of a little loss in accuracy.

The seminar will be on February 19, Monday at 15:30

in EA-502