Bilkent University
Department of Computer Engineering


A Profile HMM-based hybrid error correction algorithm for long sequencing reads


Can Fırtına
MS Student
Computer Engineering Department
Bilkent University

Next-gen sequencing technologies generate DNA sequence reads faster and in a fraction of the cost of the older technologies. There are multiple sequencing platforms available, which have different strengths and weaknesses. PacBio and Nanopore sequencers can output reads that reach up to 40 kb in length. Increased read length enables efficient downstream analyses such as structural variation and isoform detection. The caveat of these sequencers is to produce reads with significantly higher error rate (15-20%) than the short reads (<=150 bp) generated using Illumina. Thus, any analysis using long reads require a preprocessing step for error correction. In this study, we propose an algorithm for correcting long reads using a profile Hidden Markov Model (pHMM). The method aligns high-confidence short reads onto long reads and then trains the pHMM to learn posterior transition and emission probabilities to produce an error corrected consensus long read. Unlike existing methods which use simple majority voting among aligned short reads per base, our scheme is probabilistic and more flexible. Our preliminary results show that the algorithm increases the mapping rate of long reads to the ground truth from 40% to 80%. We also show that our correction performance in terms of mapping rate and sequence accuracy is greater than that of the state-of-the-art error correction method, named LSC.


DATE: 24 April, 2017, Monday @ 16:05