Bilkent University
Department of Computer Engineering
CS 590/690 SEMINAR
Conpresso: Compressing and querying genome collections
Ali Erdem Karaçay
Master Student
(Supervisor:Assoc.Prof.Can Alkan)
Computer Engineering Department
Bilkent University
Abstract: Motivation: The decreasing costs of sequencing technologies have led to an exponen- tial increase in available genomic data, fueling large-scale initiatives such as the Human Pangenome Reference Consortium, and AllTheBacteria. However, this accessibility has introduced a critical computational bottleneck: data storage. Although general-purpose compressors reduce file sizes dramatically, they do not exploit the unique structural char- acteristics of genomic sequences. Consequently, efficient storage and querying of massive genomic datasets have become a paramount research challenge. In this paper, we explore how locally consistent parsing (LCP) within a dictionary-based architecture captures these genomic redundancies. Our approach not only achieves highly efficient compression but also uniquely enables querying the compressed data without decompression. Results: Our LCP-based compressor achieves up to 5× greater compression than general- purpose tools and up to 2× greater compression than specialized genomic tools in specific use cases. Using frequency-mapped encoding, our architecture maintains high performance and minimal memory overhead. Crucially, the tool operates entirely de novo, requiring no reference genome, and allows sub-linear time sequence searches directly on the compressed archive.
DATE: March 30, Monday @ 16:30 Place: EA 502