Project topics for CS490 (Spring 2009)
1) Query processing
performance of static-pruned index files in a compressed environment
Description: Search
engines essentially use an inverted index structure to obtain the query results
in a timely manner. Still, for very large collections like the Web; it is
impossible to use the entire inverted index during query processing, since some
query terms may occur in millions or even billions of documents. To this end,
there exist several static pruning techniques to reduce the inverted index
size. Once the index is pruned, it’s size is further compressed by further
employing compression techniques.
In the literature, pruned indexes are usually
compared against a full-index and shown to provide time gains, due to smaller
index size, which means less disk access time, less cache space and quicker
evaluation. They are also compared in terms of effectiveness. However, there is
no work that compares the time performance of different pruning strategies
(except [1] for a rather limited case). That is, each strategy prunes different
posting lists with different amounts. It is clear that each will have better
query processing times in comparison to a full index. However, it is unclear
how do these techniques compare to each other. In this project, we will
investigate each component that contributes to query processing time (e.g.,
disk read, cache miss ratio, CPU execution) and compare how different
techniques vary at the same pruning levels. In particular, the project includes
the following stages:
[1]
Blanco, R. Index
Compression for Information Retrieval Systems. Doctoral Thesis, University of A
Coruña, 2008.
[2]
AltIngovde, I. S.,
Ozcan, R., Ulusoy, Ö., A Practitioner’s Guide for Static Index Pruning, In Proc. of
31st European Conference on Information Retrieval (ECIR'09), to appear,
2009