Bilkent University
Department of Computer Engineering


Probabilistic-Logic Programming for Biological Sequence Analysis


Prof. Dr. Henning Christiansen
Professor of Computer Science
Roskilde University, Denmark

In a research project 2007-2012, we studied a novel sort of probabilistic-logic models for analysis of biological sequence data such as DNA. The starting point was a probabilistic version of the logic programming language Prolog, PRISM developed by T.Sato and his group at Tokyo Institute of Technology. It is equipped with generalized machine learning and prediction facilities, as to find, e.g., the "most probable proof", which in the DNA case may be interpreted as the most relevant annotation of possible genes.

Standard models such as Hidden Markov Models and Stochastic Context-Free Grammars can be expressed in strikingly few lines of codes, and all sorts of extension thereof, e.g., into context-sensitive are also straightforward to express. Initially, there were serious problems in scaling up to sequences of realistic size, but they have been more or less solved during the project.

Collections of models for different signals, as well as external tools such as BLAST etc., can be integrated into larger models called Bayesian Annotation Networks, which again are supported by a practical script language called BanPipe. The most remarkable feature of the approach is the ease with which new models can be put together and tested; execution speed is relevant for practical experiements although considerably slower that, say, specialized HMM based tools hand written in C.

The same methods seem also to have applications for a large range of other data analysis.


DATE: 29 May, 2013, Wednesday @ 13:40