Department of Computer Engineering
S E M I N A R
Exploiting Inter-Class Rules for Focused Crawling
A focused crawler is an agent that concentrates on a particular target topic, and tries to visit and gather only relevant pages from a rather narrow segment of the Web. A crucial issue for a focused crawler is the underlying heuristic that will be used for deciding the page to be visited next. Here, we propose a rule based approach to improve the harvest rate and coverage of a baseline focused crawler. Baseline focused crawler (proposed by Chakrabarti et al.) employs a canonical topic taxonomy to train a Naive-Bayesian classifier, which is in turn used to score unseen URLs. Our research explores using simple rules derived from inter-class (topic) linkage patterns while deciding on the crawler's next move. The rule-based approach also enhances the baseline crawler in supporting tunneling. The initial performance results we obtained are quite encouraging for our rule-based focused crawling method as compared to the baseline crawler.
DATE: November 8, 2004, Monday @ 15:40