Project topics for CS490 (Fall 2008)
1) Site-based
Dynamic Query Pruning for Search
Engines
Description: Search engines use early stopping heuristics while
query processing for fast and effective answering of the user queries. We
recently proposed site-based pruning, which means that a search engine
considers only Web pages from those “promising” sites for the given query, and
discard all other Web sites. The goal of this project is implementing some of
the dynamic pruning in the literature and comparing their performance against
the site-based pruning, with possible optimizations on the latter. In
particular, the project includes the following stages:
- Literature study: The student is expected to read some relevant
dynamic pruning techniques in the literature, such as those based on max_score pruning, frequency-sorted inverted index
files and impact-sorted index files. Also, (s)he
should learn about our site-based pruning approach, by reading the
relevant material [1] and consulting with the research group members.
- Implementation and research tasks: At the first place, some of the techniques
mentioned in the first stage would be implemented on top of our IR system,
or maybe on top of an open-source system, to serve as the baseline. The
implementation of the site-based pruning system is already available, but
the student is expected to make some further additions: such as
considering URL hierarchy and the number of pages at a particular Web site
to improve performance.
- Experimentation: The dataset from TREC including 25 million Web
pages would be used to compare the performance of the proposed approach
and the techniques in the literature.
2) Hybrid
techniques for the static pruning of inverted index files
Description: Search engines essentially use an inverted index
structure to obtain the query results in a timely manner. Still, for very large
collections like the Web; it is impossible to use the entire inverted index
during query processing, since some query terms may occur in millions or even
billions of documents. To this end, there exist several static pruning
techniques to reduce the inverted index size. However, it is highly possible
that using a single technique would not be sufficient alone, and a large scale
search engine should combine several techniques simultaneously for providing
high performance without any reduction in the result quality. The goal of this
project is investigating how the static pruning methods can be combined and
evaluating the performance and result quality of such hybrid approaches. In
particular, the project includes the following stages:
- Literature study: The student is expected to read some static
pruning techniques in the literature, in particular, the term-centric [2],
document-centric [3] and access-based [4, 5] methods.
- Implementation and research tasks: At the first place, all three techniques
mentioned in the first stage would be implemented on top of our IR system,
or maybe on top of an open-source system, to serve as the baseline. Next,
we will investigate how these techniques can be combined: some possible
options are applying techniques in some linear order, or learning a
decision function to decide what to prune.
Experimentation: The dataset from TREC including 25 million Web pages would be used to evaluate
the performance of the hybrid approaches.
3) Cluster-based patent retrieval
Description: Patent data takes very large volumes and an expert has
to retrieve and control many existing patents to issue a new patent to an
application. This project aims to develop cluster based retrieval (CBR) techniques
(i.e., grouping patents on similar topics together) for efficient and effective
retrieval of patents. The major stages of the project are:
- Literature study: The student is expected to read the papers on
CBR published by our research group [13], as well as some of the recent
articles on patent retrieval [6].
- Implementation and research tasks: The source code of our system is already
available for CBR, so the student’s main task is first preprocessing the
dataset (which may be in another language, like Japan) and making it usable in
our system. Next, several methods for clustering the patents (i.e., may
require implementing a clustering algorithm), representing
the clusters (e.g., KL-divergence based approaches, etc.) and retrieval
can be investigated.
- Experimentation: The dataset is expected to be obtained from a
Japanese organization. Alternatively, a repository can be created by
crawling some patent sites on the Web. The system would be evaluated in terms
of efficiency (given that the queries, i.e., input patterns, are much
longer than a typical Web query) and retrieval quality.
4) Access-based
inverted index pruning techniques
Description: This project is aimed to investigate many static and
dynamic pruning techniques for inverted index files based on the access
frequency of the documents in the collection. The project involves the
following stages:
- Literature study: The student is expected to read the papers on
access-based pruning techniques [4, 5].
- Implementation and research tasks: At the first place, the access-based retrieval
techniques would be implemented as described in the literature. Next, a
static pruning approach using access-counts would be implemented. We also
want to observe the smoothing effect of the clusters for access-based
retrieval, i.e., how exactly the system performance is affected if
clusters are used directly or indirectly in this process. The clusters can
be created automatically (i.e., by using an algorithm like C3M), manually
(i.e., like DMOZ) or by exploiting the site-name of the Web pages.
Finally, a more fine grain approach using access-counts per posting list
entry will be explored for dynamic pruning.
Experimentation: The dataset from TREC including 25 million Web pages and a query log
of AOL with 20 million queries would be used to evaluate the performance of the
proposed approach.
5) Rule
based focused crawling techniques
Description: A focused crawler is a program that traverses the Web
pages (by following the hyperlinks) and downloads only those pages that are
about a given “topic”, and that’s why it is called “focused”. In this project,
first the relationships among different topics will be investigated to guide a
focused crawler. Secondly, an evaluation framework for a number of existing
crawlers will be created and the performance of the proposed technique will be
measured. The steps of the project are as follows.
- Literature study: The student is expected to learn the notion of
focused crawling [7] and read relevant papers of our research group [8, 9].
- Implementation and research tasks: At the first place, an open source crawler
should be adapted to work as a focused crawler. On top of this crawler, a
rule-based approach would be implemented. The rule-based approach also
requires extracting rules among different topics that can be learned from
a hierarchy, say, DMOZ (www.odp.com).
Next, several open-source focused crawlers will be deployed (which may
require good command of Linux) and a comparison framework will be
developed to evaluate the performance of different focused crawling
approaches, including ours.
- Experimentation: The experiments would be realized using the
comparison framework developed in the previous stage, by either using an
online crawl approach or by simulating a crawl over the already downloaded
data. The performance of the rule-based focused crawler will be evaluated
in terms of the harvest rate, i.e., ratio of the number of on-topic pages to
the number of the downloaded pages.
6) Efficient
storage of query results in a search engine’a static cache
Description: Search engines store query results in a static cache
to assure fast response time for popular queries. This project will investigate
cluster based methods for utilizing the storage space. The steps of the project
are as follows:
- Literature study: The student should read and learn the relevant
material [10].
- Implementation and research tasks: At the first place, a clustering technique for
queries would be developed that specifically exploits the query result
similarity. Next, different and more compact storage schemes would be
investigated and implemented.
Experimentation: The subset of a dataset from AOL (including 20 million queries) would
be used to evaluate the storage efficiency of the system.
7) Efficient
Query Processing for Web Directories
Description: Many Web directories (like DMOZ, Yahoo, etc.) process
queries that are asked under a certain category by the user (e.g., Top.Science, Top.Economics,
etc.). This project involves enhancing the recently proposed techniques for efficient
processing of such queries in a Web directory. The stages of the project
involves following:
- Literature study: The student is expected to learn the notion of category
restricted queries [11] and read relevant papers of our research group [12].
- Implementation and research tasks: First of all, the data should be downloaded from
DMOZ Web site (www.odp.com) and
preprocessed. Next, it would be indexed by using the CBR system already
implemented in our research group. Additionally, some of the techniques in
the literature should be adapted and implemented for this dataset, to
serve as a baseline. Alternative methods of representing clusters will
also be investigated.
- Experimentation: The experiments would be realized using the DMOZ
crawl data which is expected to be around 2 million pages. The efficiency
and effectiveness of the system will be compared to the approaches in the
literature.
8) Cross language social tagging
Description: Cross-language
IR is a well-known research problem which essentially aims to retrieve
documents in some language which is different than the language of the submitted
query. In analogy, we assume that the tags at the social bookmarking sites like
del.ic.ious
etc. may be given at one language but searched at another language. This
project aims to investigate retrieval techniques for such cases. Project stages
involve the following:
- Literature study: The student is expected to learn the notion of
social bookmarking.
- Implementation and research tasks: At the first place, a dataset that involves URLs
and associated tags must be constructed. This may require a crawling
process. Next, approaches to search the tags in the languages other than
the query language should be investigated. Such approaches may include
using direct translation of query words, using WordNet,
or exploiting document contexts with the tags in the same language with
the query. Finally, a Web-based evaluation environment will be developed
and a user-study will be conducted.
- Experimentation: The dataset will be obtained by crawling, as
described above. The system performance will be measured in terms of the
quality of the results and user satisfaction as obtained from the
user-study.
References
[1] S. Altingovde, E. Demir, F. Can, Ö. Ulusoy, Site-based
Dynamic Pruning for Query Processing in Search Engines, 31st Annual
International ACM SIGIR Conference
on Research and Development in Information Retrieval (SIGIR'08), Singapore,
2008. ( pdf copy)
[2] Aya Soffer, David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, Yoëlle S. Maarek: Static Index
Pruning for Information Retrieval Systems. SIGIR 2001: 43-50.
[3] Stefan Büttcher, Charles L. A. Clarke: A document-centric approach
to static index pruning in text retrieval systems. CIKM 2006: 182-189.
[4] Steven Garcia, Andrew
Turpin: Efficient Query Evaluation Through
Access-Reordering. AIRS 2006: 106-118.
[5] Steven Garcia, Hugh E.
Williams, Adam Cannane: Access-Ordered Indexes. ACSC
2004: 7-14.
[6] In-Su Kang, Seung-Hoon Na, Jungi Kim, Jong-Hyeok Lee: Cluster-based patent retrieval. Inf.
Process. Manage. 43(5): 1173-1182 (2007).
[7] Soumen Chakrabarti, Martin van den
Berg, Byron Dom: Focused Crawling: A New Approach to
Topic-Specific Web Resource Discovery. Computer Networks 31(11-16): 1623-1640
(1999).
[8] I. S. Altingovde, Ö. Ulusoy, Exploiting Interclass Rules for Focused
Crawling, IEEE Intelligent Systems, vol.19, no.6, 2004. ( pdf copy)
[9] I. S. Altingövde, R. Ozcan, S. Cetintas, H.Yilmaz, Ö. Ulusoy, An
Automatic Approach to Construct Domain-Specific Web Portals, ACM 16th
Conference on Information and Knowledge Management (CIKM’07), Lisbon,
Portugal, November 2007. ( pdf copy)
[10] R. Ozcan, I. S. Altingovde, Ö. Ulusoy, Space
Efficient Caching of Query Results in Search Engines, International
Symposium on Computer and Information Sciences (ISCIS'08), to appear, 2008.
[11] Cacheda, F., Baeza-Yates, R. An optimistic model for
searching Web directories. In Proc. of ECIR’04,
364-377, 2004.
[12] I. S. Altingovde,
F. Can, Ö. Ulusoy, Efficient Processing of
Category-Restricted Queries for Web Directories, 30th European
Conference on Information Retrieval (ECIR'08), Lecture
Notes in Computer Science (Springer Verlag),
vol.4956, 2008.( pdf
copy)
[13] S. Altingovde, E. Demir, F. Can, Ö. Ulusoy, Incremental
Cluster-Based Retrieval Using Compressed Cluster-Skipping Inverted Files, ACM
Transactions on Information Systems, vol.26, no.3, 2008.
( pdf copy)