A New Dynamic and Adaptive Scheme for Indexing in Metric Spaces


Umut Tosun
Computer Engineering Department
Bilkent University

Computer Science applications are often concerned with efficient storage and retrieval of data. Well defined structure of traditional databases help to access required query objects effectively using the Relational Database paradigm. However, in recent times, we are faced with the challenges of dealing with unstructured and complex data such as images, video, sound clips and text documents. Multimedia Information Retrieval, Data Mining, Pattern Recognition, Machine Learning, Computer Vision, Biomedical Databases are examples of some of the fields that require efficient management of complex data. Complex, unstructured type of data often cannot be broken down into well-defined components, and exact matching cannot be applied for defining queries. Instead, the notion of similarity search is used where a query or prototype object is provided by the user and the database retrieves objects that are similar.

One popular approach for similarity searching is to approximate the relationship between database objects by mapping them into a vector space. There are well-known indexing methods in literature that supports similarity queries in vector spaces, however, it has been shown that these methods are ineffective for high dimensional data. Another approach is to use Metric Spaces model for indexing. Metric spaces are defined by a distance function that has the triangular inequality property. Since there are no assumptions about the structure of the data itself, they constitute a higher level abstraction and thus have more applicability. They have also been shown to perform better in higher dimensions.

A lot of the previous work in metric spaces have concentrated on static methods that do not allow new insertions once the index structure has been initialized. M-Tree, Slim-Tree, DF-Tree, Omni are some of the popular dynamic structures. These methods can grow incrementally by splitting overflowed nodes and adding new levels to the tree very much like the B-tree variants. Unfortunately, they have been shown to perform very poorly compared to flat structures such as AESA, LAESA, Spaghettis and Kvp that use a fixed set of global pivots. The distances between the query object and the pivots are computed to eliminate some portion of the database from consideration. The number of pivots can be easily increased to provide more selectivity, thus better query performance. However, there is an optimum number of pivots for a given query radius, and using too many pivots increases the costs of queries and the initialization of the index. Recently, SSS was introduced as a LAESA variant that allows insertions of new database objects and dynamically promotes some of the new objects as pivots.

In this thesis, we argue that SSS has fundamental problems that results in poor query performance for clustered or otherwise skewed distributions. Real datasets have often been observed to show such characteristics. We show that SSS has been optimized to work for a symmetrical, balanced distribution and for a specific radius value. Our first main contribution is offering a new pivot promotion scheme that can perform robustly for clustered or skewed distributions. Our second contribution is proposing new methods that solves the problem of determining the right number of pivots for different query radius values. We show that our new indexing scheme performs significantly better than tree-based dynamic structures while having lower insertion costs. We also show that our structure adapts to changes in the database population in a superior way.


DATE: 17 August, 2007, Friday@ 14:00