Seminar in Computer Engineering

Bilkent University
Department of Computer Engineering
M.S.THESIS PRESENTATION

Pairwise Whole Genome Alignment Using Locally Consistent Parsing

Ecem İlgün
Master Student
(Supervisor: Assoc.Prof.Can Alkan)

Computer Engineering Department
Bilkent University

Abstract: Pairwise whole-genome alignment is a fundamental problem in computational biology, with applications in evolutionary analysis, variant discovery, and comparative genomics. While established tools like Minimap2, Progressive Mauve, and Mummer4 have seen significant advancements, the increasing length and complexity of reference-grade assemblies pose significant scaling challenges. On a scale of billion base pairs, efficient alignment typically relies on the seed-chain- extend heuristic, where representative sketches partition the genomes, which are then combined into an optimal chain to guide the alignment. Distributed and parallelized multiple genome alignment relies on efficiently partitioning the input genomes into smaller segments that can be processed independently. Existing partitioning methods often rely on maximal exact matches (MEMs), maximal unique matches (MUMs), or minimizers for sketching. However, for MEMs/MUMs, the alignment process is complicated by the O(m · log n) time required to find MEMs of size m in a string of size n. Similarly, minimizers have drawbacks in their distribution patterns and frequency due to their short length, which leads to suboptimal partitioning in terms of computational and communication overhead. Compared to minimizers, Locally Consistent Parsing (LCP) can offer a more thorough and condensed representation of the input data by identifying “cores,” or brief genomic sequences that are consistently present across genomes. Furthermore, LCP cores can be computed hierarchically in linear time, with higher-level cores acting as primary anchors that form a coarse- grained chain. In contrast, lower-level cores are used for finer- grained tuning and partitioning. Because LCP-based partitioning can result in more balanced computational loads and lower communication overhead between processes, it is better suited for distributed alignment. Building on these advantages, we developed a pairwise genome alignment framework that utilizes LCP as an alternative to MEMs/MUMs and minimizers, enabling the scalable and efficient alignment of large-scale genomic datasets.

DATE: January 20, Tuesday @ 15:15 Place: EA 409