CHAPTER 2
MATHEMATICAL MODELS AND COMPUTATIONAL METHODS FOR INFERENCE OF GENETIC NETWORKS

Tatsuya Akutsu

Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Japan

2.1 INTRODUCTION

Genes maintain organisms by interacting with each other through messenger ribonucleic acids (mRNAs), proteins, and other types of molecules. Interactions among genes are often represented as networks, which are called gene regulatory networks, or genetic networks in short. Genetic networks are usually represented as directed graphs, in which nodes correspond to genes and edges correspond to regulatory relationships between two genes.

Deciphering genetic networks is important for understanding complex cellular systems because they play important roles in cells through control of protein production. In order to infer genetic networks, various kinds of data have been used such as gene expression profiles (particularly mRNA expression profiles), CHromatin ImmunoPrecipitation-chip data for transcription binding information, deoxyribonucleic acid (DNA)–protein interaction data, and mRNA seq data generated by using the next-generation DNA sequencing technology. In this chapter, we focus on inference of genetic networks from gene expression time series data (see Figure 2.1) because many studies have been done based on this kind of data and many of the developed methods and techniques may be applied to mRNA seq data.

Block Observed data on left has rectangular blocks v1(t)- v3(t) with fluctuating curves. Arrow inference leads to block genetic network with six points v1-v6, arrows connecting them.

Figure 2.1 Inference of a genetic network from gene expression time series data.

Various mathematical models have been applied and/or developed to infer genetic networks from gene expression profiles, which include Boolean networks, Bayesian networks, dynamic Bayesian networks, linear and nonlinear differential equation models, and graphical Gaussian models (GGMs). Since too many models and methods have been developed and applied, it is impossible to review all or many of them. Therefore, in this chapter, we focus on basic and simple models and methods, and try to explain basic ideas/concepts in them. Readers interested in more advanced models and methods are referred to more comprehensive review articles [6, 18, 23, 48].

2.2 BOOLEAN NETWORKS

A Boolean network is a discrete model of genetic networks [22]. It is one of the simplest models of genetic networks and was proposed in 1960s [21].

A Boolean network G(V, F) consists of a set V = {v1, …, vn} of nodes and a list F = (f1, …, fn) of Boolean functions, where we use n to denote the number of nodes throughout this chapter. Each node corresponds to a gene and takes either 0 (gene is expressed) and 1 (gene is not expressed) at each discrete time t. The state of node vi at time t is denoted by vi(t), and the states of all nodes change synchronously according to given regulation rules.1 A Boolean function with inputs from specified nodes is assigned to each node, where it represents a regulation rule for vi and the (ordered) set of input nodes (parent nodes) is denoted by pa(vi). The state of node vi at time t + 1 is determined by

numbered Display Equation

The vector consisting of the states of all nodes v(t) = (v1(t), …, vn(t)) is called a (global) state of a Boolean network at time t. We also write vi(t + 1) = fi(v(t)) to denote the regulation rule for vi and v(t + 1) = f(v(t)) to denote the regulation rule for the whole network.

Diagram has three circles v1, v2, v3. Arrows connect v1 to v2, v3; v2 to v3; v3 to v1, v2. Three equations beside - v1(t+1) = v3(t), v2(t+1) = v1(t) ᴧ v3̅(̅t̅)̅, v3(t+1) = v1(t) ᴧ  v2̅(̅t̅)̅.

Figure 2.2 Example of a Boolean network.

The structure of a Boolean network can be represented by a directed graph G(V, E), where E is the set of edges defined by . An edge from to vi means that directly affects the expression of vi. The number of input nodes to vi (i.e., |pa(vi)|) is called the indegree of vi and we use K to denote the maximum indegree of a Boolean network.

Figure 2.2 shows an example of a Boolean network. In this example, the state of node v1 at time t + 1 is determined by the state of node v3 at time t, the state of node v2 at time t + 1 is determined by the logical AND of the state of v1 and the negation (i.e., logical NOT) of the state of v3 at time t, and the state of node v3 at time t + 1 is determined by AND of the state of node v1 and NOT of the state of node v2 at time t, where we use xy and to denote logical AND of x and y, and logical NOT of x, respectively.

The dynamics of a Boolean network can be well described by a state transition diagram (see Figure 2.3 for an example). In this diagram, each node corresponds to a global state and each edge corresponds to a transition of global states from time t to t + 1. For example, an edge from 110 to 010 in Figure 2.3 means that if the global state is [1, 1, 0] at time t, the global state becomes [0, 1, 0] at time t + 1. It is also seen that if v(0) = [1, 1, 0], the global state changes as

numbered Display Equation

and the same global state [0, 0, 0] is repeated after t = 1. It is also seen if v(0) = [0, 0, 1], [1, 0, 0], and [0, 1, 1] are repeated alternatively after t = 0. Sets of repeating states are called attractors. Each attractor corresponds to a directed cycle in a state transition table, and the number of elements in an attractor is called the period of the attractor. An attractor with period 1 is called a singleton attractor, whereas an attractor with period greater than 1 is called a cyclic attractor. In Figure 2.3, we can see two singleton attractors {[1, 0, 1]} and {[0, 0, 0]}, and one cyclic attractor {[1, 0, 0], [0, 1, 1]} with period 2. Because of an interpretation that attractors correspond to types of cells, extensive studies have been done on distribution of attractors [10, 43] although no conclusive result has yet obtained. Many studies have also been done to efficiently detect attractors [11, 20, 32]. However, it is known that detection of an attractor is non-deterministic polynomial-time hard (NP-hard) [1].

Boolean network diagram has three flowcharts- on left block 101 leads to itself; in middle block 110 leads to 010 to 000 to itself; on left 111, 001 leads to 100 to 011 and back to 100.

Figure 2.3 State transition diagram for the Boolean network in Figure 2.2.

The inference problem of a Boolean network from time series gene expression data is defined as follows. Let (Ij, Oj) (j = 1, …, m) be a pair of expression profiles (i.e., 0–1 vectors) of v1, …, vn, where Ij corresponds to a global state at time t and Oj corresponds to a global state at time t + 1. Iji (resp. Oji) denotes the expression (0 or 1) of gene vi in Ij (resp. Oj). Each pair (Ij, Oj) is called a sample. We say that a Boolean function fi assigned to node vi is consistent with a sample (Ij, Oj) if holds, and G(V, F) is consistent with (Ij, Oj) if all Boolean functions in F are consistent with (Ij, Oj). For a set of samples EX = {(I1, O1), (I2, O2), …, (Im, Om)}, we say that G(V, F) (resp., node vi) is consistent with EX if G(V, F) (resp., node vi) is consistent with all (Ij, Oj) for 1 inline j inline m. Then, the inference problem is defined as follows: given the number of nodes n and a set of samples EX = {(Ij, Oj) | j = 1, …, m}, find a Boolean network with n nodes that is consistent with EX, where “none” should be reported if it does not exist. For example, suppose that the following set of samples are given:

numbered Display Equation

Then, the following functions are consistent with EX:

numbered Display Equation

whereas the following functions are inconsistent with EX:

numbered Display Equation

In general, this inference problem is NP-hard [15]. However, if the maximum indegree K is a constant (e.g., 2 or 3), this problem can be solved in polynomial time by exhaustive search [2]. Liang et al. developed a faster algorithm by making use of mutual information [29]. Fukagawa and Akutsu showed that a simple greedy-type algorithm can correctly identify a Boolean network for most cases with high probability if samples are given uniformly at random [15].

Akutsu et al. studied the sample complexity in order to uniquely identify a Boolean network (i.e., how many samples are needed so that there exists at most only one Boolean network consistent with the samples) and showed that O(K22K + 1log n) samples are enough on the average under the assumption that each sample (i.e., Ij) is generated uniformly at random [2], where the constant factor was improved by Perkins and Hallett [39]. Since 2n samples are required (i.e., the whole state transition diagram should be given) if there is no restriction on K, this result suggests that only a small part of the state transition diagram is enough for identification of a Boolean network if K is a constant. It should also be noted that the assumption of uniform randomness does not hold for real-time series data and thus more samples are required in practice.

Recently, a new approach based on semi-tensor product has been proposed for analysis of Boolean networks, and has been applied to various problems, which include stability analysis, identification, and control [4].

2.3 PROBABILISTIC BOOLEAN NETWORK

In a Boolean network, the global state at time t + 1 is determined uniquely from the global state at time t. However, real genetic networks may not necessarily work deterministically. Therefore, it is more realistic to assume that genetic networks work stochastically by means of the effects of noise and elements other than genes. In order to introduce such effects, Shmulevich et al. proposed the probabilistic Boolean network (PBN) [46, 47].2

A PBN is an extension of a Boolean network. The difference between two models lies only on that multiple Boolean functions fij(j = 1, 2, …, l(i)) are assigned to each node vi and one of them is randomly chosen at each time step according to the probability assigned to each function. At each time step, fij is chosen with pre-defined probability cij ⩾ 0, where cij should satisfy ∑l(i)j = 1cij = 1 for each i = 1, 2, …, n. Let fj be the jth possible realization of a Boolean network:

numbered Display Equation

where N = ∏ni = 1l(i) is the maximum possible number of different realizations of Boolean networks. The probability of choosing such a realization is given by

numbered Display Equation

since the selection of the Boolean function for each node is independent.

Figure 2.4 shows an example of PBN. Suppose that the global state of PBN at time t is [0, 0, 0]. If (f11f21f31) is selected with probability 0.8 × 0.7 = 0.56, the global state at time t + 1 remains [0, 0, 0]. Similarly, if (f11f21f32) is selected with probability 0.8 × 0.3 = 0.24, the global state at time t + 1 remains [0, 0, 0]. On the other hand, if (f12f21f31) is selected with probability 0.2 × 0.7 = 0.14 or (f12f21f32) is selected with probability 0.2 × 0.3 = 0.06, the global state at time t + 1 becomes [1, 0, 0]. Therefore, we have the following transition probabilities:

Diagram has three circles v1, v2, v3. Arrows connect v1 to v2, v3; v2 to v3; v3 to v1, v2. Table beside has two columns - Boolean function, Probability; three rows - f1
      1, f2
      1; f1
      2; f1
      3, f2
      3.

Figure 2.4 Example of a probabilistic Boolean network.

numbered Display Equation

where the probabilities of the transitions from [0, 0, 0] to the other states are 0. For another example, the (nonzero) transition probabilities from [0, 1, 0] are as follows:

numbered Display Equation

Transitions of global states in a Boolean network were represented by a state transition diagram. In a PBN, transitions are well represented by using 2n × 2n matrix A. For each 0–1 vector a = [a1, …, an] of size n, we assign an integer number id(a) between 1 and 2n by id(a) = 2n − 1a1 + 2n − 2a2 + ⋅⋅⋅ + 2an − 1 + an + 1. Then, A is defined by

numbered Display Equation

where i = id(b) and j = id(a). For example, the transition matrix of the PBN of Figure 2.4 is as follows:

numbered Display Equation

The first column of this matrix represents the transition probabilities from [0, 0, 0]; the second column represents the transition probabilities from [0, 0, 1]; and so on. Since this matrix corresponds to a state transition matrix in a standard Markov chain (with 2n states), the techniques developed in the field of Markov chain can be applied to PBNs.

In order to infer a PBN from gene expression time series data, we need to determine Boolean functions assigned to each node and the corresponding probabilities. For inference of Boolean functions, exhaustive search or heuristic methods are usually used. For determination of probabilities, the coefficient of determination is widely used, which is a measure of the relative decrease in error from estimating the state of vi via fij rather than by just the best constant estimate. Many studies have been done on simulation of biological systems using PBNs and on finding control strategies for biological systems. For details of inference, simulation, and control studies on PBNs, see a recent review in Ref. [53].

2.4 BAYESIAN NETWORK

The Bayesian network is a statistical model, and has been developed and studied in the fields of statistics and artificial intelligence. A Bayesian network is usually defined by a directed acyclic graph and conditional probabilities.

Let G(V, E) be a directed acyclic graph with a set of nodes V and a set of directed edges E, where acyclic means that there is no directed cycle (i.e., there is no path beginning and ending with the same node). As in Section 2.2, pa(vi) denotes the ordered set of input nodes to vi. In the rest of this section, we use Xi to denote both the ith node (i.e., vi) and random variable assigned to this node. In Figure 2.5(a), pa(X1) = pa(X2) = ∅ and pa(X3) = (X1, X2). In the basic version, each Xi takes discrete values. For example, we consider a case in which each Xi takes either 0 or 1. Then, logical AND (i.e., X3 = X1X2) can be represented by the following conditional probabilities:

numbered Display Equation
Bayesian network has four diagrams a-d, each with three circles X1, X2 on top, X3 at bottom. In a arrows from X1, X2 leads to X3; in b X1 to X3 to X2; in c X3 to X1, X2; in d X1 to X2, X3 and X3 to X2.

Figure 2.5 Examples of Bayesian networks.

We also denote by P(Xi | pa(Xi)) the conditional probability of Xi given pa(Xi). Using conditional probabilities, the probability of the whole network is given by

numbered Display Equation

For each of the networks given in Figure 2.5, this probability is given as follows:

  1. P(X1, X2, X3) = P(X3 | X1, X2)P(X1)P(X2),
  2. P(X1, X2, X3) = P(X3 | X2)P(X2 | X1)P(X1),
  3. P(X1, X2, X3) = P(X1 | X3)P(X2 | X3)P(X3),
  4. P(X1, X2, X3) = P(X2 | X1, X3)P(X3 | X1)P(X1).

For inference of Bayesian network structure G = G(V, E) from observed data D, we use the following relation:

numbered Display Equation

In order to calculate P(D | G), a family of probability distributions characterized by a set of parameters θ is employed. Then, P(D | G) is given by

numbered Display Equation

Although computation of this score is very difficult in general, it can be given by a simple formula in some special but useful cases [7].

Then, the inference problem is to find G maximizing P(D | G) (under the assumption that the prior probability P(G) is the same for all G). However, this problem is proved to be NP-hard [5]. Therefore, such heuristic algorithms as greedy algorithms and simulated annealing have been utilized. However, it is reported that an optimal solution can be obtained if n is small (e.g., n inline 30) [38]. In the case of application of Bayesian networks to inference of genetic networks, the domain consisting of three states, { − 1, 0, 1}, has been used [13], where 0, − 1, and 1 mean the neutral, down-regulated, and up-regulated levels, respectively.

Various extensions and modifications of Bayesian networks have been developed and applied for inference of genetic networks. Although we have considered discrete values for variables, continuous values can also be used. For example, Imoto et al. developed a method to infer Bayesian networks with continuous variables by combining nonparametric regression and information criteria [19].

Another important variant of Bayesian networks is the dynamic Bayesian network [40]. In dynamic Bayesian networks, random variable Xi depends on time and is represented as Xi(t). Furthermore, the state of Xi(t + 1) depends on states of pa(Xi)(t) (states of pa(Xi) at time t). Therefore, dynamic Bayesian networks are defined via the conditional probability,

numbered Display Equation

Different from usual Bayesian networks, loops are allowed in dynamic Bayesian networks. However, time series data are necessary to infer dynamic Bayesian networks although usual Bayesian networks can be inferred from static data (e.g., data obtained by gene knockout and overexpression experiments).

If we consider the binary domain (i.e., each variable takes 0 or 1), dynamic Bayesian networks are almost the same as PBNs. This relationship is discussed in Ref. [26]. Dynamic Bayesian networks have been further extended to time-delayed Bayesian networks in which Xi(t + 1) depends not only on pa(Xi)(t) but also on pa(Xi)(t − 1), …, pa(Xi)(th), where h is some positive integer, and have been applied to inference of genetic networks [30].

2.5 GRAPHICAL GAUSSIAN MODELING

A graphical Gaussian model is a statistical model based on conditional probabilities and Gaussian distributions. A GGM is represented with an undirected graph in which each edge represents the pairwise correlation between two nodes conditioned against the correlations with all other nodes. In the GGM approach, indirect association effects are eliminated by evaluating conditional dependencies in multivariate Gaussian distributions (see also Figure 2.6). It is to be noted that undirected networks are inferred using the GGM approach, different from Boolean and Bayesian network approaches.

As discussed in Sections 2.2 and 2.3, we assume that gene expression time series data are given as vi(t) (i = 1, …, n, t = 1, …, m). However, different from Boolean and Bayesian models, we assume that vi(t) takes real values. The standard Pearson correlation coefficient rij between vi and vj is defined as

numbered Display Equation
Two diagrams have five circles v1-v5, solid lines connect v1 to v2; v2 to v5, v3; v5 to v4. Dotted lines from v1 to v5, v4, v3 in left diagram. Arrow use of partial correlations in between.

Figure 2.6 Inference of a genetic network using partial correlations. Edges (shown by dotted lines) with small partial correlations are (repeatedly) eliminated in the graphical Gaussian model.

where represents the mean value of vi(t) (i.e., ). It is to be noted that is the standard deviation of vi(t). Let R be the correlation matrix consisting of rijs (i.e., R = (rij)) and let R− 1 = (rij) be the inverse matrix of R. Then, the partial correlation coefficient pij is defined as

numbered Display Equation

Let P be the partial correlation matrix consisting of pijs (i.e., P = (pij)), where the partial correlation means the pairwise correlation between two nodes conditioned against the correlations with all other nodes. Therefore, if there is no direct correlation between nodes vi and vj, pij = 0 holds (under a linear Gaussian model). However, in practice, we cannot expect that pij = 0 holds exactly even if there is no direct correlation. Therefore, we delete edge(s) whose partial correlation is minimum or less than some threshold. Furthermore, we can apply computation of partial correlation and deletion of edges repeatedly, where R for the next step is computed from the modified P in which 0 is assigned to pijs corresponding to deleted edges.

GGMs have been applied to inference of genetic networks by many groups [8, 25, 45, 52, 55]. Other methods for eliminating indirect edges have also been developed and applied. For example, mutual information [31] and entropy maximization [28] have been applied for that purpose.

Recently, new approaches for eliminating indirect correlations have been proposed: the silencing method [3] and the network deconvolution method [12]. The silencing method is based on a dynamical model. In this method, the global response matrix,

numbered Display Equation

and the local response matrix,

numbered Display Equation

are used, where we use vi to denote both a node and its expression value. G captures the change of vi’s activity in response to changes of vj’s activity, and can be obtained from gene knockout and overexpression experiments. It is to be noted that G includes the effects of indirect correlations whereas S consists of only direct correlations. Therefore, G and S correspond to observed correlations and direct correlations, respectively. These two matrices are related by

numbered Display Equation

which leads to a matrix-based method to compute S from G.

In the network deconvolution method, it is assumed that the true network is given in the form of n × n matrix , and the observed network has the form of

numbered Display Equation

where I is the identity matrix. Then, can be computed from by

numbered Display Equation

It is to be noted that

numbered Display Equation

corresponds to indirect correlations.

In the above-mentioned methods, only elimination of spurious edges/correlations is considered. However, it may also be useful to modify a known network by addition and deletion of edges so that the resulting network is more consistent with observed data. This approach is called network completion (see Figure 2.7) and various methods have been proposed although the target networks do not necessarily include genetic networks. Guimerà and Sales-Pardo developed a method that can identify both missing and spurious interactions by using the stochastic block models [16]. Hanneke and Xing [17] developed a method based on random sampling of subnetworks and analyzed confidence intervals from samples [8]. Kim and Leskovec developed the KronEM algorithm by combining the expectation–maximization (EM) method with the Kronecker graphs [24]. Nakajima et al. developed a network completion method by combining dynamic programming and least-squared fitting [36].

Two diagrams - initial, completed network have central circle with five circles around connected by arrows. Block time series data points to arrow in between for deleted, added edge.

Figure 2.7 Illustration of network completion.

2.6 DIFFERENTIAL EQUATIONS

Differential equations have been widely used in many areas in science and engineering. They have also been used for describing dynamics of various biological systems. Therefore, it is reasonable to use differential equations for inference of genetic networks, and many such studies have been done.

First, we consider inference methods based on linear differential equations. Let V = {v1, …, vn} be a set of nodes (i.e., a set of genes). Let vi(t) denote the expression level of gene vi at time t. Then, we assume that vi(t)s (i = 1, …, n) are determined by the following differential equations:

numbered Display Equation

where ai, js are parameters. We relate these parameters to the network structure by the correspondence that there exists a directed edge (vj, vi) if and only if ai, j ≠ 0. In practice, “ai, j ≠ 0” should be replaced by “|ai, j| > δ” for some positive constant δ. Furthermore, since only gene expression data for discrete time steps are available, we approximate the differential equations with difference equations:

numbered Display Equation

where Δ is the unit time (i.e., period between two consecutive time steps). The inference problem based on these difference equations is defined as follows: given gene expression time series data vi(t) for i = 1, …, n and t = 0, Δ, 2Δ, …, (m − 1)Δ, infer parameters ai, j for i = 1, …, n, j = 0, …, n that satisfy the above difference equations.

Then, the difference equations give a system of linear equations in which vi(t)s are constants (since they are given from gene expression data) and ai, js are variables to be determined. However, we cannot expect that this system has a unique solution because the number of equations n(m − 1) is not necessarily equal to the number of variables n(n + 1). In practice, the number of time points m is much smaller than the number of genes n. Furthermore, we need to consider the effect of observation and system noises. In order to cope with these problems, linear regression, precisely, linear regression using least-squares fitting, has been applied [9, 49]. In this approach, we find a set of parameters, ai, js, that minimizes the sum of least-squared errors:

numbered Display Equation

This minimization problem can be solved by using matrix computations.

Although the above-mentioned approach is easy to implement, it may cause overfitting owing to unbalance between the number of data and the number of variables. One way to cope with the overfitting problem is to introduce a regularization term in the objective function, where L1 and L2 regularization terms have widely been used [18]. For example, in the case of L1 regularization (least absolute shrinkage and selection operator), the objective function is modified as

numbered Display Equation

where λ is a parameter to be adjusted (usually by trial and error). It is also useful to incorporate with information criteria such as Akaikes information criteria and Bayesian information criteria [37].

Next, we consider inference based on nonlinear differential equations. Of course, various types of nonlinear differential equations have been applied and thus it is impossible to introduce all or many of them. Therefore, we only introduce the S-system [44] because extensive studies have been done on inference of S-systems. In the S-system modeling, differential equations have the following form:

numbered Display Equation

where αi, βi, gi, j, and hi, j (i = 1, …, n, j = 1, …, n) are parameters to be estimated from time series data vi(t). Again, we approximate these differential equations by difference equations. Different from the linear case, no simple and efficient method is known for determining parameters for this model (and most of other nonlinear models). Therefore, use of evolutionary computation methods is a good choice. Details of such methods are explained in Chapter 5 (by Kimura) and Chapter 8 (by Chowdhury and Chetty) of this book along with other types of nonlinear differential equations.

In addition to differential equation-based modeling, inference methods based on mass-action modeling have also been proposed. In this approach, each biochemical reaction R is represented in the following form:

numbered Display Equation

where {S1, …, SN} denotes a set of different molecular species occurring in R and ai, bi are non-negative integer parameters called the stoichiometric coefficients. Nobel et al. developed a method for inference of genetic networks based on mass-action modeling by combining Cartesian genetic programming and particle swarm optimization. They also discuss an extension of the method to support stochastic simulation algorithms like the Gillespie algorithm. For details of mass-action modeling based approaches, see Chapter 6 (by Nobile et al.).

Expression versus time ranging 1-13 graph has three fluctuating curves for gene 1-3. Arrow inference from graph point to three blocks below linked by arrow change at 5, 10 on time axis.

Figure 2.8 Illustration of inference of time-varying networks. In this problem, it is required to identify both change points and network structures.

2.7 TIME-VARYING NETWORK

We have seen various models and methods for inference of genetic networks. All the reviewed models and methods assume that the topology of the network does not change through time. However, the real gene regulatory network in a cell might dynamically change its structure depending on the effects of certain shocks, cell division, and so on. Therefore, it is also important to develop mathematical models and methods for inferring time-varying genetic networks based on time series gene expression data (see Figure 2.8). In this problem, it is required to infer both change time points and network structures between two consecutive change time points (including the starting and end time points).

For this problem, various models and methods have been proposed. For example, Yoshida et al. developed a dynamic linear model with Markov switching that represents change points in regimes, which evolve according to a first-order Markov process [54]. Robinson and Hartemink proposed a nonstationary dynamic Bayesian network, based on dynamic Bayesian networks [42]. Fujita et al. proposed a method based on the dynamic autoregressive model by extending the vector autoregression model [14]. Lèbre et al. developed another autoregressive time-varying algorithm by adopting a combination of reversible jump Markov chain Monte Carlo and dynamic Bayesian networks [27]. Thorne and Stumpf developed a method based on modeling of the regulatory network structure between distinct segments with a set of hidden states by applying the hierarchical Dirichlet process and hidden Markov model [51]. Rassol and Bouaynaya developed a method based on constrained and smoothed Kalman filtering [41]. Nakajima and Akutsu extended a network completion method based on dynamic programming and least-squares fitting [36] for time-varying networks [35].

2.8 CONCLUSION

In this chapter, we have briefly reviewed mathematical models and computational methods for inference of genetic networks from gene expression time series data. Since too many methods have been proposed, we could only review basic and simple models and methods. Although we have focused on inference from gene expression data only, integration of various types of data can also be useful for inference of genetic networks, and various methods have been proposed [18] (see also Chapter 3 (by Zhang)).

It is quite difficult to assess which inference method is the best (or better). In order to assess inference methods, we need to know the true genetic networks or gold standard networks. However, it is quite difficult to determine true genetic networks by biological experiments because genes are not directly regulated; they are regulated in a complex way via mRNA, proteins, and other factors (e.g., epigenetic factors). This situation is very different from metabolic networks and protein–protein interaction networks although it is still difficult to determine true protein–protein interactions by biological experiments.

For protein structure prediction, Critical Assessment of protein Structure Prediction (CASP) has been playing an important role in assessment and development of prediction methods [34]. The success of CASP may be due to that accurate structures can be determined by X-ray and NMR experiments. In order to assess inference and analysis methods for gene expression data, DREAM (Dialogue for Reverse Engineering Assessments and Methods) [33] and CAMDA (Critical Assessment of Techniques for Microarray Data Analysis) [50] have been organized, where CAMDA changed its name and scope to Critical Assessment of Massive Data Analysis. Different from CASP, it is difficult to provide the benchmark data and assess the results. Therefore, these projects provide various types of data, which include real gene expression data and simulated gene expression data. It is to be noted that the gold standards are available to assess the prediction results when simulated gene expression data are used. For example, in a recent challenge [34], two subchallenges were proposed: one for parameter estimation for given topology and biochemical structure of a 9-gene regulatory network, and the other for prediction of three missing links for an incomplete topology with 11 genes. An interesting point of this recent challenge is that participants are allowed to (virtually) buy several kinds but limited amount of experimental data that are generated in silico. The benchmark data provided in these projects are becoming useful to develop and evaluate new inference and analysis methods.

NOTES

REFERENCES

  1. T. Akutsu, S. Kuhara, O. Maruyama, and S. Miyano. A system for identifying genetic networks from gene expression patterns produced by gene disruptions and overexpressions. Genome Informatics, 9:151–160, 1998.
  2. T. Akutsu, S. Miyano, and S. Kuhara. Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. In: R. B. Altman, A. K. Dunker, L. Hunter, T. E. Klein (eds.). Proceedings of Pacific Symposium on Biocomputing 1999, World Scientific, Singapore, pp. 17–28, 1999.
  3. B. Barzel and A.-L. Barabási. Network link prediction by global silencing of indirect correlations. Nature Biotechnology, 31:720–725, 2013.
  4. D. Cheng, H. Qi, and Z. Li. Analysis and Control of Boolean Networks. A Semi-Tensor Product Approach. Springer, Heidelberg, 2011.
  5. D. M. Chickering. Learning Bayesian networks is NP-complete. In: D. Fisher and H.-J. Lenz (eds). Learning from Data: Artificial Intelligence and Statistics V, Springer, Heidelberg, 1996.
  6. K.-H. Cho, S.-M. Choo, S. H. Jung, J.-R. Kim, H.-S. Choi, and J. Kim. Reverse engineering of gene regulatory networks. IET Systems Biology, 1:149–163, 2007.
  7. G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309–347, 1992.
  8. A. de la Fuente, N. Bing, I. Hoeschele, and P. Mendes. Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics, 20:3565–3574, 2004.
  9. P. D’haeseleer, X. Wen, S. Fuhrman, and R. Somogyi. Linear modeling of mRNA expression levels during CNS development and injury. In: R. B. Altman, A. K. Dunker, L. Hunter, T. E. Klein (eds.). Proceedings of Pacific Symposium on Biocomputing 1999, World Scientific, Singapore, pp. 41–52, 1999.
  10. B. Drossel, T. Mihaljev, and F. Greil. Number and length of attractors in a critical Kauffman model with connectivity one. Physical Review Letters, 94:088701, 2005.
  11. E. Dubrova and M. Teslenko. A SAT-based algorithm for finding attractors in synchronous Boolean networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8:1393–1398, 2011.
  12. S. Feizi, D. Marbach, M. Médard, and M. Kellis. Network deconvolution as a general method to distinguish direct dependencies in networks. Nature Biotechnology, 31:726–733, 2013.
  13. N. Friedman, N. Linial, I. Nachman, and D. Pe’er. Using Bayesian networks to analyze expression data. Journal of Computational Biology, 7:601–620, 2000.
  14. A. Fujita, J. R. Sato, H. M. Garay-Malpartida, P. A. Morettin, M. C. Sogayar, and C. E. Ferreira. Time-varying modeling of gene expression regulatory networks using the wavelet dynamic vector autoregressive method. Bioinformatics, 23:1623–1630, 2007.
  15. D. Fukagawa and T. Akutsu. Performance analysis of a greedy algorithm for inferring Boolean functions. Information Processing Letters, 93:7–12, 2005.
  16. R. Guimerà and M. Sales-Pardo. Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences of the United States of America, 106:22073–22078, 2009.
  17. S. Hanneke and E. P. Xing. Network completion and survey sampling. Journal of Machine Learning Research, 5:209–215, 2009.
  18. M. Hecker, S. Lambeck, S. Toepfer, E. van Someren, and R. Guthke. Gene regulatory network inference: data integration in dynamic models—a review. Biosystems, 96:86–103, 2009.
  19. S. Imoto, T. Higuchi, T. Goto, K. Tashiro, S. Kuhara, and S. Miyano. Combining microarrays and biological knowledge for estimating gene networks via Bayesian networks. Journal of Bioinformatics and Computational Biology, 2:77–98, 2004.
  20. D. J. Irons. Improving the efficiency of attractor cycle identification in Boolean networks. Physica D, 217:7–21, 2006.
  21. S. A. Kauffman. Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of Theoretical Biology, 22:437–467, 1969.
  22. S. A. Kauffman. The Origins of Order: Self-organization and Selection in Evolution, Oxford University Press, New York, 1993.
  23. G. Karlebach and R. Shamir. Modelling and analysis of gene regulatory networks. Nature Reviews Molecular Cell Biology, 9:770–780, 2008.
  24. M. Kim and J. Leskovec. The network completion problem: inferring missing nodes and edges in networks. In: B. Liu, H. Liu, C. Clifton, T. Washio, C. Kamath (eds.). Proceedings of the 2011 SIAM International Conference on Data Mining, SIAM, Philadelphia, pp. 47–58, 2011.
  25. H. Kishino and P. J. Waddell. Correspondence analysis of genes and tissue types and finding genetic links from microarray data. Genome Informatics, 11:83–95, 2000.
  26. H. Lähdesmäki, S. Hautaniemi, I. Shmulevich, and O. Yli-Harja. Relationships between probabilistic Boolean networks and dynamic Bayesian networks as models of gene regulatory networks. Signal Processing, 86:814–834, 2006.
  27. S. Lèbre, J. Becq, F. Devaux, M. P. H. Stumpf, and G. Lelandais. Statistical inference of the time-varying structure of gene-regulation networks. BMC Systems Biology, 4:130, 2010.
  28. T. R. Lezon, J. R. Banavar, M. Cieplak, A. Maritan, and N. V. Fedoroff. Using the principle of entropy maximization to infer genetic interaction networks from gene expression patterns. Proceedings of the National Academy of Sciences of the United States of America, 103:19033–19038, 2006.
  29. S. Liang, S. Fuhrman, and R. Somogyi. REVEAL, a general reverse engineering algorithm for inference of genetic network architectures. In: R. B. Altman, A. K. Dunker, L. Hunter, T. E. Klein (eds.). Proceedings of Pacific Symposium on Biocomputing 1998, World Scientific, Singapore, pp. 18–29, 1998.
  30. T.-F. Liu, W.-K. Sung, and A. Mittal. Learning gene network using time-delayed Bayesian framework. International Journal on Artificial Intelligence Tools, 15:353–370, 2006.
  31. A. A. Margolin, K. Wang, W. K. Lim, M. Kustagi, I. Nemenman, and A. Califano. Reverse engineering cellular networks. Nature Protocols, 1:662–671, 2006.
  32. A. A. Melkman and T. Akutsu. An improved satisfiability algorithm for nested canalyzing functions and its application to determining a singleton attractor of a Boolean network. Journal of Computational Biology, 20:958–969, 2013.
  33. P. Meyer, et al. Network topology and parameter estimation: from experimental design methods to gene regulatory network kinetics using a community based approach. BMC Systems Biology, 8:13, 2014.
  34. J. Moult, K. Fidelis, A. Kryshtafovych, T. Schwede, and A.Tramontano. Critical assessment of methods of protein structure prediction (CASP) round x. Proteins: Structure, Function, and Bioinformatics, 82(S2):16, 2014.
  35. N. Nakajima and T. Akutsu. Exact and heuristic methods for network completion for time-varying genetic networks. BioMed Research International, 2014:684014, 2014.
  36. N. Nakajima, T. Tamura, Y. Yamanishi, K. Horimoto, and T. Akutsu. Network completion using dynamic programming and least-squares fitting. The ScientificWorld Journal, 2012:957620, 2012.
  37. N. Noman, L. Palafox, and H. Iba. On model selection criteria in reverse engineering gene networks using RNN model. In: G. Lee, D. Howard, J. J. Kang, D. Ślinlinezak (eds.). Proceedings of the International Conference on Convergence and Hybrid Information Technology, Springer, Berlin, pp. 155–164, 2012.
  38. S. Ott, S. Imoto, and S. Miyano. Finding optimal models for small Gene networks. In: R. B. Altman, A. K. Dunker, L. Hunter, T. E. Klein (eds.). Proceedings of Pacific Symposium on Biocomputing 2004, World Scientific, Singapore, pp. 557–567, 2004.
  39. T. J. Perkins and M. J. Hallett. A trade-off between sample complexity and computational complexity in learning Boolean networks from time-series data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7:118–125, 2010.
  40. B.-E. Perrin, L. Ralaivola, A. Mazurie, S. Bottani, J. Mallet, and F. dAlchéBuc. Gene networks inference using dynamic Bayesian networks. Bioinformatics, 19:ii138–ii148, 2013.
  41. G. Rassol and N. Bouaynaya. Inference of time-varying gene networks using constrained and smoothed Kalman filtering. In: Proceedings of International Workshop on Genomic Signal Processing and Statistics, IEEE, Piscataway, NJ, pp. 172–175, 2012.
  42. J. Robinson and A. Hartemink. Non-stationary dynamic Bayesian networks. In: D. Koller, D. Schuurmans, Y. Bengio, L. Bottou (eds.). Proceedings of the 22nd Annual Conference on Neural Information Processing Systems, Curran Associates, Inc., Red Hook, NY, pp. 1369–1376, 2008.
  43. B. Samuelsson and C. Troein. Superpolynomial growth in the number of attractors in Kauffman networks. Physical Review Letters, 90:098701, 2003.
  44. M. Savageau. Biochemical Systems Analysis. A Study of Function and Design in Molecular Biology. Addison-Wesley, Massachusetts, 1976.
  45. J. Schäfer and K. Strimmer. An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics, 21:754–764, 2005.
  46. I. Shmulevich and E. R. Dougherty. Probabilistic Boolean Networks: The Modeling and Control of Gene Regulatory Networks. SIAM, Philadelphia, 2010.
  47. I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang. Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics, 18:261–274, 2002.
  48. C. Sima, J. Hua, and S. Jung. Inference of gene regulatory networks using time-series data: a survey. Current Genomics, 10:416–429, 2009.
  49. E. P. van Someren, L. F. A. Wessels, and M. J. T. Reinders. Linear modeling of genetic networks from experimental data. In: P. E. Bourne, M. Gribskov, R. B. Altman, N. Jensen, D. A. Hope, T. Lengauer, J. C. Mitchell, E. D. Scheeff, C. Smith, S. Strande, H. Weissig (eds.). Proceeding of the 18th International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Palo Alto, CA, pp. 355–366, 2000.
  50. C. Telstone. Vital statistics. Nature, 424:610–512, 2003.
  51. T. Thorne and M. P. H. Stumpf. Inference of temporally varying Bayesian Networks. Bioinformatics, 28:3298–3305, 2012.
  52. H. Toh and K. Horimoto. Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling. Bioinformatics, 18:287–297, 2002.
  53. R. Trairatphisan, A. Mizera, J. Pang, A. A. Tantar, H. Schneider, and T. Sauter. Recent development and biomedical applications of probabilistic Boolean networks. Cell Communication and Signaling, 11:46, 2013.
  54. R. Yoshida, S. Imoto, and T. Higuchi. Estimating time-dependent gene networks from time series microarray data by dynamic linear models with Markov switching. In: V. Markstein (ed). Proceedings of 4th Computational Systems Bioinformatics, IEEE, Piscataway, NJ, pp. 289–298. 2005.
  55. Y. Yuan, C-T. Li, and O. Windram. Directed partial correlation: inferring large-scale gene regulatory network through induced topology disruptions. PLoS One, 6:e16835, 2011.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset