CS490 Project Topics for Fall 2008

Project topics for CS490 (Spring 2009)

1) Query processing performance of static-pruned index files in a compressed environment

Description: Search engines essentially use an inverted index structure to obtain the query results in a timely manner. Still, for very large collections like the Web; it is impossible to use the entire inverted index during query processing, since some query terms may occur in millions or even billions of documents. To this end, there exist several static pruning techniques to reduce the inverted index size. Once the index is pruned, it’s size is further compressed by further employing compression techniques.

In the literature, pruned indexes are usually compared against a full-index and shown to provide time gains, due to smaller index size, which means less disk access time, less cache space and quicker evaluation. They are also compared in terms of effectiveness. However, there is no work that compares the time performance of different pruning strategies (except [1] for a rather limited case). That is, each strategy prunes different posting lists with different amounts. It is clear that each will have better query processing times in comparison to a full index. However, it is unclear how do these techniques compare to each other. In this project, we will investigate each component that contributes to query processing time (e.g., disk read, cache miss ratio, CPU execution) and compare how different techniques vary at the same pruning levels. In particular, the project includes the following stages:

Literature study: The student is expected to learn basics about inverted index pruning [1, 2], compression and query processing.

Implementation and research tasks: At the moment, our research group implemented several pruning techniques and created pruned files for various datasets. This project involves:

Compression: The pruned index files should be compressed using the gamma compression with our existing software. We also plan to implement variable byte coding [].

Query processing: In query processing codes, the following changes should be done: First, decompression routines should be adapted. Next, the several statistics should be held. Most importantly, we will simulate a cache and collect cache hit/miss ratios (cache access times would not be included in CPU time analysis). Different caching strategies (static/dynamic/hybrid) and eviction policies (LRU etc.) should be implemented. Finally, if time permits, additional pruning policies that are guided by some query specific techniques will be implemented.

Experimentation: The experiments would be realized using the DMOZ crawl data which is expected to be around 2 million pages.

[1] Blanco, R. Index Compression for Information Retrieval Systems. Doctoral Thesis, University of A Coruña, 2008.

[2] AltIngovde, I. S., Ozcan, R., Ulusoy, Ö., A Practitioner’s Guide for Static Index Pruning, In Proc. of 31st European Conference on Information Retrieval (ECIR'09), to appear, 2009