Summary of research topics
- Analysis of second and third
generation sequencing technologies.
Basically, we are interested in analyzing any data that is generated by
DNA sequencing platforms.
- Genome variation discovery and genotyping: The Human Genome Project took 15 years and cost 3-10 billion dollars, but the new high throughput seqencing (HTS) platforms now make it possible to resequence the genome of a human individual in approximately 10 days for approximately $5,000. In the next year, this cost is expected to be reduced to $1,000. We can discover various forms of genomic variation from single nucleotide polymorphisms to structural variants by analyzing the observed read properties; however, this is a computationally difficult problem as the HTS platforms produce billions of short (~100-150 characters long) sequence data, where the human genome length is around 3 billion characters, and the problem is further complicated by the repeats present in the human genome. We will develop novel algorithms to comprehensively and quickly discover all forms of genomic variants including point mutations, indel polymorphisms and structural variation while resolving inconsistencies among different variants to accurately identify normal and disease-causing variation.
- Read mapping for Illumina, Complete Genomics, Roche/454, AB SOLiD, Ion Torrent and Pacific Biosciences platforms: High throughput sequencing technologies promise the era of preventive and personalized medicine through low-cost and high-throughput genome sequencing. The success of all medical and genetic applications of next-generation sequencing critically depends on the existence of computational technologies that can process and analyze the enormous amounts of sequence data fast and in an energy-efficient manner without the need to build large infrastructures. The goal of this proposal is to develop such technologies by combining the benefits of enhanced software algorithms and specialized hardware accelerators such as GPGPUs, FPGAs and ASICs.
- De novo and hybrid (multi-platform) sequence assembly: Thanks to the substantially reduced cost of genome sequencing, there is now great interest in sequencing the genomes of thousands of species to better understand the genomic diversity across different organisms, organismal biology and genome evolution. However, the limitations of the HTS technologies affected de novo sequencing studies that aim to construct the reference genomes of various species. This is mainly due to the repetitive structure of the genomes of most species, the short sequence reads generated by current platforms, and the increased error rate. Reasoning from the previous observations and empirical evidence that all current HTS platforms show different strengths and biases, we propose to devise novel genome assembly algorithms that use data from multiple sources, including, when available, data derived from laboratory experiments to better assemble the genomes of new species.
- Genomic repeat discovery, classification and annotation: Due to difficulties in assembly, genomic repeats, especially short tandem repeats (alpha satellite and other satellite sequences) are either ignored or poorly assembled in published genome assemblies. This project aims to discover repeated sequence motifs from the raw high throughput sequence data by developing algorithms based on sequence analysis and graph theory.
- Visualization of genomic data.
Click here for implementation based projects for undergraduate students and summer interns.