Bilkent University
Department of Computer Engineering


GPGPU Accelerated Mapping of High Throughput Sequence Reads


Mustafa Korkmaz
MSc Student
Computer Engineering Department
Bilkent University

The high throughput sequencing (HTS) methods have already started to fundamentally revolutionize the area of genome research through low-cost and high-throughput genome sequencing. However, the sheer size of data imposes various computational challenges. For example, in the Illumina HiSeq2000, each run produces over 7-8 billion short reads and >600 Gb of basepairs of sequence data within less than 10 days. For most applications, analysis of HTS data starts with read mapping, i.e. finding the locations of these short sequence reads in a reference genome assembly.

The similarities between two sequences can be determined by computing their optimal local alignments using a dynamic programming method called the Smith-Waterman algorithm. The Smith-Waterman algorithm is widely used in hash-based DNA read mapping algorithms because of its high sensitivity. However, the quadratic time complexity of this algorithm makes it highly time-consuming and the main bottleneck in analysis. In addition to this drawback, the short length of reads (~100 bp) and the large size of mammalian genomes (3.1 Gbp for human) worsens the situation by requiring several hundreds to tens of thousands of Smith-Waterman calculations per read. The fastest approach proposed so far avoids Smith-Waterman and maps the data described above in 70 CPU days with low sensitivity. More sensitive mapping approaches are even slower. We propose that efficient parallel implementations of string comparison will dramatically improve the running time of this process. With this motivation, we propose to develop enhanced algorithms to exploit the parallel architecture of GPUs.

Keywords: Read Mapping,String Comparison, Dynamic Programming, GPGPU, CUDA


DATE: 14 May, 2012, Monday @ 15:30