Bilkent University
Department of Computer Engineering
CS 590/690 SEMINAR

 

Copy number estimation using Counting Bloom Filters in de novo assembled genomes

 

Klea Zambaku
Master Student
(Supervisor: Assoc.Prof.Can Alkan)
Computer Engineering Department
Bilkent University

Abstract: Genomes of complex organisms contain a large amount of repetitive sequences. These repeats provide elasticity to genomes, which in turn guide evolutionary processes; however, some are also associated with several diseases either directly or indirectly through facilitating genome rearrangements. On the other hand, repeats also contribute to misassemblies due to the ambiguities they create in paths in genome assembly graphs. To detect and resolve such ambiguities, efficiently estimating the copy numbers of repeats is of interest. Here, we propose to estimate copy numbers using Bloom Filters and Counting Bloom Filters in both de novo genome assemblies and whole genome sequences to improve both run time and memory footprint. Our method implements spaced seeds to generate k-mers from genome assemblies, which are used to populate a Bloom Filter. We then check for the presence of k-mers in long reads and populate a Counting Bloom Filter if they are found in the first Bloom Filter. Next, we translate these results into segments with estimated copy numbers. In our experiments, we used long and short reads from several organisms and compared our results to existing short-read copy number prediction algorithms. We also validated our results by comparing them to MegaBlast and RepeatMasker as the ground truth. Our method will be helpful in the future as a pipeline that resolves unresolved misassemblies built by de novo assembly algorithms and is efficient in estimating copy numbers from long-read sequencing datasets.

 

DATE: April 01, Monday @ 14:30 Place: EA 502