Bilkent University
Department of Computer Engineering


Page-to-Processor Assignment Techniques for Parallel Crawlers


Ata Turk
Ph.D Student
Computer Engineering
Bilkent University

In less than a decade, theWorld WideWeb has evolved from a research project to a cultural phenomena effective in almost every facet of our society. The increase in the popularity and usage of the Web enforced an increase in the effciency of information retrieval techniques used over the net. Crawling is among such techniques and is used by search engines, web portals, and web caches. A crawler is a program which downloads and stores web pages, generally to feed a search engine or a web repository. In order to be of use for its target applications, a crawler must download huge amounts of data in a reasonable amount of time. Generally, the high download rates required for effcient crawling cannot be achieved by single-processor systems. Thus, existing large-scale applications use multiple parallel processors to solve the crawling problem. Apart from the classical parallelization issues such as load balancing and minimization of the communication overhead, parallel crawling poses problems such as overlap avoidance and early retrieval of high quality pages. This presentation will address parallelization of the crawling task, and its discussion is mainly on partitioning/page-to-processor assignment techniques applied in parallel crawlers. Two new page-to-processor assignment techniques that we propose based on graph and hypergraph partitioning will be presented, which respectively minimize the total communication volume and the number of messages, while balancing the storage load and page download requests of processors.


DATE: April 17, 2006, Monday@ 16:40