Distance-Based Indexing and Similarity Search for Genomic Sequences

Prof.
Dr. Z. MeralÖzsoyoğlu

Department of Electrical
Engineering and Computer Science

CaseWestern ReserveUniversity

Finding sequences similar to a given query
sequence in a large collection of sequences is a fundamental problem in many database
applications including, computational genomics, computational finance, image
and text processing. The similarity between sequences is defined in terms of a
distance function determined by the application domain. In this work, we
consider sequence proximity search in computational genomics, where sequence
similarity is usually an indication of an evolutionary relationship between DNA
and protein sequences, and usually indicates functional similarity. The most
popular distance measures are based on (a weighted) count of character edit or
block edit operations to transform one string to another. The main goal is to
develop efficient near neighbor search tools that work for both character edit
and block edit distances. Our premise is that the Distance Based Indexing
techniques, which are originally developed for metric distances can be modified
for sequence distance measures provided that they are almost metrics. We first
show that sequence distance functions of interest
(compression distance and weighted character edit distance) are almost metric.
We then show how to modify distance based index structures vantage point trees
to accommodate almost metric distances. We test our theoretical results on
synthetic data sets and protein sequences.

DATE: April 30, 2003, Wednesday @ 16:40 PLACE: EA-409