Department of Computer Engineering
S E M I N A R
Combinatorial models and algorithms for search engine design
Berkant Barla Cambazoglu
In the last decade, search engines became an integral part of our lives. The current state-of-the art in search engine technology rely on parallel text retrieval. Basically, a search engine is composed of three components: a crawler, an indexer, and a query processor. The crawler component aims to locate, fetch, and store the Web pages in a local document repository. The indexer component converts the stored, unstructured text into a queryable form, most often an inverted index. Finally, the query processing component performs the search over the indexed content. In this talk, we will present models and algorithms for efficient crawling, indexing, and querying of text repositories. For efficient parallel and distributed Web crawling, we propose models that aim to minimize the communication overhead between the processors while balancing the download times and storage loads of processors. We also propose models for document- and term-based inverted index partitioning. In the document-based partitioning model, the number of disk accesses is minimized while the posting storage is balanced. In the term-based partitioning model, the total amount of communication is minimized while, again, the posting storage is balanced. We further develop and evaluate a large number of algorithms for query processing in ranking-based text retrieval systems. Currently, we are working on multi-level document id reordering models for efficient inverted index compression. We test the proposed models and algorithms over our experimental search engine, Skynet. This search engine is implemented in C and is currently running on a 48-node PC cluster. In the talk, we will also discuss the design and implementation details of our grid-enabled search engine, SE4SEE.
DATE: October10, 2005, Monday@ 16:40
PLACE: EA 409