Caching and Index Pruning Techniques for Large Scale Web Search Engines


Rifat Ozcan
Ph.D Student
Computer Engineering Department
Bilkent University

The availability of digital information through the Internet has supplied an easy way to reach information for people. Web Search Engines (WSEs) have an important role in this task such that they try to connect the people with information need and the web pages containing that information. As the web grows exponentially and the number of web users increases at a high rate, today’s WSEs need efficient mechanisms to handle such scalability and acceptable response time constraints. Caching and index pruning are two techniques for this purpose. Caching tries to avoid query processing by storing the results of most frequently asked queries. On the other hand, index pruning tries to decrease the size of the inverted index by removing insignificant parts of the posting lists. In this report, we propose new approaches which use semantic relationships among queries or documents in order to increase the effectiveness and efficiency of large scale WSEs. Semantic relationships among quer ies reveal that different people might use different words for the same or very similar information needs. Traditional caching strategies work only if submitted query is exactly the same as the cached query. We propose to exploit the cached results in case of a partial hit such that cached query is a subset or superset of the new query, but not the exact same. Semantic relationships among documents lead to clusters (or category) of documents. Such a structuring of documents enables further improvements on caching and index pruning strategies.


DATE: 22 October, 2007, Monday@ 13:40