Efficient Parallel Crawling of Web Content

Abstract: This project aims collecting all the web pages in the Web and keeping pace with the rapid growth of the Web content, thus implementing an efficient Web crawler. A crawler is a program that retrieves and stores pages from the Web, commonly for a Web search engine or a Web cache. A crawler often has to download hundreds of millions of pages in a short period of time and has to continuously monitor and refresh the downloaded pages in order to provide a fresh view of the Web. Because the Web is gigantic and being continuously updated, a single-process crawler simply cannot achieve the required download rate. Thus, many existing search engines already use multiple parallel processors in solving the web crawling problem. Because of their cost-effective nature, PC clusters have a widespread usage today, and they form a practical solution for the web crawling problem. There has been little scientific research conducted on parallelization of the crawling and indexing process. This project involves development of new and efficient parallel algorithms for the web crawling and indexing problem. Efficient parallel algorithms will be developed and implemented in order to gather the fast-growing web information efficiently, accurately and to provide the required refresh frequency. In addition to providing scientific information about the structure of the World Wide Web, this project will enable gathering Turkish web pages and indexing them. Such information is very valuable as it can enable efficient and accurate searching within the Turkish web pages. Furthermore, analysis and data mining of the gathered data can reveal important sociological and statistical information about Turkey and the Turkish nation.

Keywords: Web Crawler, World Wide Web, Search Engine, PC-Cluster

Principal Investigator: Cevdet Aykanat, Ph.D.
Investigator: Berkant Barla Cambazoğlu, MSc.
Investigator: Ata Türk, BSc.
Investigator: Eray Özkural, BSc.

Duration: April 2004 - March 2006.

Sponsor: Scientific and Technical Research Council of Turkey (TUBITAK)

Grant No: 103E028

Budget: 58,000 YTL  in April 2004