Project topics for CS490 (Fall 2008)

 

1) Site-based Dynamic Query Pruning for Search Engines

 

Description: Search engines use early stopping heuristics while query processing for fast and effective answering of the user queries. We recently proposed site-based pruning, which means that a search engine considers only Web pages from those “promising” sites for the given query, and discard all other Web sites. The goal of this project is implementing some of the dynamic pruning in the literature and comparing their performance against the site-based pruning, with possible optimizations on the latter. In particular, the project includes the following stages:

 

  • Literature study: The student is expected to read some relevant dynamic pruning techniques in the literature, such as those based on max_score pruning, frequency-sorted inverted index files and impact-sorted index files. Also, (s)he should learn about our site-based pruning approach, by reading the relevant material [1] and consulting with the research group members.

 

  • Implementation and research tasks: At the first place, some of the techniques mentioned in the first stage would be implemented on top of our IR system, or maybe on top of an open-source system, to serve as the baseline. The implementation of the site-based pruning system is already available, but the student is expected to make some further additions: such as considering URL hierarchy and the number of pages at a particular Web site to improve performance.

 

  • Experimentation: The dataset from TREC including 25 million Web pages would be used to compare the performance of the proposed approach and the techniques in the literature.

 

 

2) Hybrid techniques for the static pruning of inverted index files

 

Description: Search engines essentially use an inverted index structure to obtain the query results in a timely manner. Still, for very large collections like the Web; it is impossible to use the entire inverted index during query processing, since some query terms may occur in millions or even billions of documents. To this end, there exist several static pruning techniques to reduce the inverted index size. However, it is highly possible that using a single technique would not be sufficient alone, and a large scale search engine should combine several techniques simultaneously for providing high performance without any reduction in the result quality. The goal of this project is investigating how the static pruning methods can be combined and evaluating the performance and result quality of such hybrid approaches. In particular, the project includes the following stages:

 

  • Literature study: The student is expected to read some static pruning techniques in the literature, in particular, the term-centric [2], document-centric [3] and access-based [4, 5] methods.

 

  • Implementation and research tasks: At the first place, all three techniques mentioned in the first stage would be implemented on top of our IR system, or maybe on top of an open-source system, to serve as the baseline. Next, we will investigate how these techniques can be combined: some possible options are applying techniques in some linear order, or learning a decision function to decide what to prune.

 

Experimentation: The dataset from TREC including 25 million Web pages would be used to evaluate the performance of the hybrid approaches.

 

3) Cluster-based patent retrieval

 

Description: Patent data takes very large volumes and an expert has to retrieve and control many existing patents to issue a new patent to an application. This project aims to develop cluster based retrieval (CBR) techniques (i.e., grouping patents on similar topics together) for efficient and effective retrieval of patents. The major stages of the project are:

 

  • Literature study: The student is expected to read the papers on CBR published by our research group [13], as well as some of the recent articles on patent retrieval [6]. 

 

  • Implementation and research tasks: The source code of our system is already available for CBR, so the student’s main task is first preprocessing the dataset (which may be in another language, like Japan) and making it usable in our system. Next, several methods for clustering the patents (i.e., may require implementing a clustering algorithm), representing the clusters (e.g., KL-divergence based approaches, etc.) and retrieval can be investigated.

 

  • Experimentation: The dataset is expected to be obtained from a Japanese organization. Alternatively, a repository can be created by crawling some patent sites on the Web. The system would be evaluated in terms of efficiency (given that the queries, i.e., input patterns, are much longer than a typical Web query) and retrieval quality.

 

4) Access-based inverted index pruning techniques

 

Description: This project is aimed to investigate many static and dynamic pruning techniques for inverted index files based on the access frequency of the documents in the collection. The project involves the following stages:

 

  • Literature study: The student is expected to read the papers on access-based pruning techniques [4, 5].

 

  • Implementation and research tasks: At the first place, the access-based retrieval techniques would be implemented as described in the literature. Next, a static pruning approach using access-counts would be implemented. We also want to observe the smoothing effect of the clusters for access-based retrieval, i.e., how exactly the system performance is affected if clusters are used directly or indirectly in this process. The clusters can be created automatically (i.e., by using an algorithm like C3M), manually (i.e., like DMOZ) or by exploiting the site-name of the Web pages. Finally, a more fine grain approach using access-counts per posting list entry will be explored for dynamic pruning.

 

Experimentation: The dataset from TREC including 25 million Web pages and a query log of AOL with 20 million queries would be used to evaluate the performance of the proposed approach.

 

5) Rule based focused crawling techniques

 

Description: A focused crawler is a program that traverses the Web pages (by following the hyperlinks) and downloads only those pages that are about a given “topic”, and that’s why it is called “focused”. In this project, first the relationships among different topics will be investigated to guide a focused crawler. Secondly, an evaluation framework for a number of existing crawlers will be created and the performance of the proposed technique will be measured. The steps of the project are as follows.

 

  • Literature study: The student is expected to learn the notion of focused crawling [7] and read relevant papers of our research group [8, 9].

 

  • Implementation and research tasks: At the first place, an open source crawler should be adapted to work as a focused crawler. On top of this crawler, a rule-based approach would be implemented. The rule-based approach also requires extracting rules among different topics that can be learned from a hierarchy, say, DMOZ (www.odp.com). Next, several open-source focused crawlers will be deployed (which may require good command of Linux) and a comparison framework will be developed to evaluate the performance of different focused crawling approaches, including ours.

 

  • Experimentation: The experiments would be realized using the comparison framework developed in the previous stage, by either using an online crawl approach or by simulating a crawl over the already downloaded data. The performance of the rule-based focused crawler will be evaluated in terms of the harvest rate, i.e., ratio of the number of on-topic pages to the number of the downloaded pages.

 

6) Efficient storage of query results in a search engine’a static cache

 

Description: Search engines store query results in a static cache to assure fast response time for popular queries. This project will investigate cluster based methods for utilizing the storage space. The steps of the project are as follows:

 

  • Literature study: The student should read and learn the relevant material [10].

 

  • Implementation and research tasks: At the first place, a clustering technique for queries would be developed that specifically exploits the query result similarity. Next, different and more compact storage schemes would be investigated and implemented.

 

Experimentation: The subset of a dataset from AOL (including 20 million queries) would be used to evaluate the storage efficiency of the system.

 

7) Efficient Query Processing for Web Directories

 

Description: Many Web directories (like DMOZ, Yahoo, etc.) process queries that are asked under a certain category by the user (e.g., Top.Science, Top.Economics, etc.). This project involves enhancing the recently proposed techniques for efficient processing of such queries in a Web directory. The stages of the project involves following:

 

  • Literature study: The student is expected to learn the notion of category restricted queries [11] and read relevant papers of our research group [12].

 

  • Implementation and research tasks: First of all, the data should be downloaded from DMOZ Web site (www.odp.com) and preprocessed. Next, it would be indexed by using the CBR system already implemented in our research group. Additionally, some of the techniques in the literature should be adapted and implemented for this dataset, to serve as a baseline. Alternative methods of representing clusters will also be investigated.

 

  • Experimentation: The experiments would be realized using the DMOZ crawl data which is expected to be around 2 million pages. The efficiency and effectiveness of the system will be compared to the approaches in the literature.

 

8) Cross language social tagging

 

Description:  Cross-language IR is a well-known research problem which essentially aims to retrieve documents in some language which is different than the language of the submitted query. In analogy, we assume that the tags at the social bookmarking sites like del.ic.ious etc. may be given at one language but searched at another language. This project aims to investigate retrieval techniques for such cases. Project stages involve the following:

 

  • Literature study: The student is expected to learn the notion of social bookmarking.

 

  • Implementation and research tasks: At the first place, a dataset that involves URLs and associated tags must be constructed. This may require a crawling process. Next, approaches to search the tags in the languages other than the query language should be investigated. Such approaches may include using direct translation of query words, using WordNet, or exploiting document contexts with the tags in the same language with the query. Finally, a Web-based evaluation environment will be developed and a user-study will be conducted.

 

  • Experimentation: The dataset will be obtained by crawling, as described above. The system performance will be measured in terms of the quality of the results and user satisfaction as obtained from the user-study.

 

 

References

[1] S. Altingovde, E. Demir, F. Can, Ö. Ulusoy, Site-based Dynamic Pruning for Query Processing in Search Engines, 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'08), Singapore, 2008. ( pdf copy)

[2] Aya Soffer, David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, Yoëlle S. Maarek: Static Index Pruning for Information Retrieval Systems. SIGIR 2001: 43-50.

[3] Stefan Büttcher, Charles L. A. Clarke: A document-centric approach to static index pruning in text retrieval systems. CIKM 2006: 182-189.

[4] Steven Garcia, Andrew Turpin: Efficient Query Evaluation Through Access-Reordering. AIRS 2006: 106-118.

[5] Steven Garcia, Hugh E. Williams, Adam Cannane: Access-Ordered Indexes. ACSC 2004: 7-14.

[6] In-Su Kang, Seung-Hoon Na, Jungi Kim, Jong-Hyeok Lee: Cluster-based patent retrieval. Inf. Process. Manage. 43(5): 1173-1182 (2007).

[7] Soumen Chakrabarti, Martin van den Berg, Byron Dom: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Computer Networks 31(11-16): 1623-1640 (1999).

[8] I. S. Altingovde, Ö. Ulusoy, Exploiting Interclass Rules for Focused Crawling, IEEE Intelligent Systems, vol.19, no.6, 2004. ( pdf copy)

[9] I. S. Altingövde, R. Ozcan, S. Cetintas, H.Yilmaz, Ö. Ulusoy, An Automatic Approach to Construct Domain-Specific Web Portals, ACM 16th Conference on Information and Knowledge Management (CIKM’07), Lisbon, Portugal, November 2007. ( pdf copy)

[10] R. Ozcan, I. S. Altingovde, Ö. Ulusoy, Space Efficient Caching of Query Results in Search Engines, International Symposium on Computer and Information Sciences (ISCIS'08), to appear, 2008.

[11] Cacheda, F., Baeza-Yates, R. An optimistic model for searching Web directories. In Proc. of ECIR’04, 364-377, 2004.

[12] I. S. Altingovde, F. Can, Ö. Ulusoy, Efficient Processing of Category-Restricted Queries for Web Directories, 30th European Conference on Information Retrieval (ECIR'08),  Lecture Notes in Computer Science (Springer Verlag), vol.4956, 2008.( pdf copy)

[13] S. Altingovde, E. Demir, F. Can, Ö. Ulusoy, Incremental Cluster-Based Retrieval Using Compressed Cluster-Skipping Inverted Files, ACM Transactions on Information Systems, vol.26, no.3, 2008. ( pdf copy