Bilkent University
Department of Computer Engineering


SPADIS: An Algorithm for Selecting Predictive and Diverse SNPs in GWAS


Serhan Yılmaz
MS Student
Computer Engineering Department
Bilkent University

Phenotypic heritability of complex traits and diseases is seldom explained by individual genetic variants identified in genome-wide association studies (GWAS). Many methods have been developed to select a subset of variant loci, which are associated with or predictive of the phenotype. Selecting single nucleotide polymorphisms (SNPs) that are close on a biological network have been proven successful in finding biologically interpretable and predictive SNPs. However, we argue that the closeness constraint favors selecting redundant features that affect similar biological processes and therefore does not necessarily yield better predictive performance. To this end, we propose a novel method called SPADIS that selects a set of loci such that diverse regions in the underlying SNP-SNP network are covered. SPADIS favors remotely located SNPs in order to account for the complementary additive effects of SNPs that are associated with the phenotype. This is achieved by maximizing a submodular set function with a greedy algorithm that ensures a constant factor (1-1/e) approximation. We compare SPADIS to the state-of-the-art method SConES, on a dataset of Arabidopsis Thaliana genotype and continuous flowering time phenotypes. SPADIS has better regression performance in 12 out of 17 phenotypes on average, it identifies more candidate genes and runs faster. We also investigate the use of Hi-C data to construct SNP-SNP network in the context of SNP selection problem for the first time, which yields slight improvements in regression performance.


DATE: 12 March, 2018, Monday, CS590 & CS690 presentations begin at @ 15:40