black microarray DiGE International Workshop on
Probabilistic Modelling in Computational Biology
26 July 2006 Vienna

Probabilistic Methods for
Active Learning and Data Integration in Computational Biology

Affiliated with ISMB/ECCB 2007

Motivation and Outline of Sessions

Probabilistic methods are an obvious choice for integrative inference because the posterior information about a biological system that arises from one source of information can be regarded as the prior for incorporating information from another source. This approach can be applied iteratively and so lends itself naturally to data integration. Bayesian theory provides the axiomatic framework for a unified approach to important challenges of computational biology: Besides allowing sequential belief updates, it supports optimal decision making under uncertainty and provides means for learning models from heterogeneous data sources.

In this context, the theory can be applied to experimental design, optimizing subsequent measurements based on an analysis of data collected so far, and thus providing a principled approach to active learning. Such an optimization of experimental design can maximize the information gain in model inference, with alternating experimental assays and subsequent model inference. This approach, for example, allows an improved inference of Bayesian networks capturing gene interactions. An intrinsic problem for the interpretation of such networks is the variety of equivalence classes of the same joint distribution that may arise: as Bayes’ theorem allows the reversal of certain network edges, this prohibits statements on causality. As a powerful remedy, Bayesian experiment design can provide an optimal set of genes to test in subsequent measurements, for example, by gene knock-out, RNAi, or over-expression experiments.

Another example of an application of probabilistic data integration is the joint analysis of microarray data from different biological systems for common molecular mechanisms as witnessed by gene activity. In addition, the very same type of models can be exploited to increase the statistical power and hence reduce the cost of experiments by combining different microarray studies of the same biological question. In contrast to approaches where data from multiple laboratories or technologies has to be normalized before a joint analysis is possible, this meta level method directly combines information extracted from the data.

Similar models can also be used to integrate heterogeneous data sources. A typical application, for example, is the combination of database information like functional gene annotation or protein interaction measurements with microarray gene expression data. This increases the statistical power of model inference and thus helps overcome the limitations due to the small numbers of microarray samples available in most studies.

The above three application areas in computational biology suggest themselves as workshop modules. After introducing the application context and giving an overview of probabilistic approaches, we will therefore devote one session to applications of Bayesian experimental design, one session will focus on the inference of common processes from multiple experiments, and a third session will explore the probabilistic integration of heterogeneous data sources.

  Vienna Science and Technology Fund (WWTF)     Boku Bioinformatics