S E M I N A R
Department of Electrical Engineering and Computer Science
Finding sequences similar to a given query sequence in a large collection of sequences is a fundamental problem in many database applications including, computational genomics, computational finance, image and text processing. The similarity between sequences is defined in terms of a distance function determined by the application domain. In this work, we consider sequence proximity search in computational genomics, where sequence similarity is usually an indication of an evolutionary relationship between DNA and protein sequences, and usually indicates functional similarity. The most popular distance measures are based on (a weighted) count of character edit or block edit operations to transform one string to another. The main goal is to develop efficient near neighbor search tools that work for both character edit and block edit distances. Our premise is that the Distance Based Indexing techniques, which are originally developed for metric distances can be modified for sequence distance measures provided that they are almost metrics. We first show that sequence distance functions of interest (compression distance and weighted character edit distance) are almost metric. We then show how to modify distance based index structures vantage point trees to accommodate almost metric distances. We test our theoretical results on synthetic data sets and protein sequences.
DATE: April 30, 2003, Wednesday @ 16:40