Efficiency and Effectiveness of Query Processing in Cluster-Based Retrieval

FAZLI CAN

iSMAiL SENGÖR ALTINGÖVDE

ENGiN DEMiR

(canf, ismaila, endemir)@cs.bilkent.edu.tr

Computer Engineering Department, Bilkent University

Bilkent, Ankara, 06533, Turkey; August 8, 2002

Abstract

Our research shows that for large databases, without considerable additional storage overhead, cluster-based retrieval (CBR) can compete with the time efficiency and effectiveness of the inverted index-based full search (FS). The proposed CBR method employs a storage structure that blends the cluster membership information into the inverted file posting lists. This approach significantly reduces the cost of similarity calculations for document ranking during query processing and improves efficiency. For example, in terms of in-memory computations, our new approach can reduce query processing time to 39% of FS. The experiments confirm that the approach is scalable and system performance improves with increasing database size. In the experiments, we use the Cover Coefficient-based Clustering Methodology (C3M), and the Financial Times database of TREC-4 containing 210,158 documents of size 564 MB defined by 229,748 terms with total of 29,545,234 inverted index elements. This study provides CBR efficiency and effectiveness experiments using the largest corpus in an environment that employs no user interaction or user behavior assumption for clustering..

Keywords: Clustering, cluster-based retrieval, information retrieval, performance, query processing.

The full paper is available here.

Source codes are available for the following tasks:

preprocessing (to create document vector, create inverted index and skipping inverted index)

C3M clustering

performing FS, ICIIS and ICsIIS

Note that, we can not provide data sets as they are copyrighted material of the TREC.

Please e-mail to I. Sengör Altingövde or Engin Demir for any questions related to the source code.

Last updated on 29.08.2002