Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 2
MATHEMATICAL MODELS AND COMPUTATIONAL METHODS FOR INFERENCE OF GENETIC NETWORKS

Tatsuya Akutsu

Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Japan

2.1 INTRODUCTION

Genes maintain organisms by interacting with each other through messenger ribonucleic acids (mRNAs), proteins, and other types of molecules. Interactions among genes are often represented as networks, which are called gene regulatory networks, or genetic networks in short. Genetic networks are usually represented as directed graphs, in which nodes correspond to genes and edges correspond to regulatory relationships between two genes.

Deciphering genetic networks is important for understanding complex cellular systems because they play important roles in cells through control of protein production. In order to infer genetic networks, various kinds of data have been used such as gene expression profiles (particularly mRNA expression profiles), CHromatin ImmunoPrecipitation-chip data for transcription binding information, deoxyribonucleic acid (DNA)–protein interaction data, and mRNA seq data generated by using the next-generation DNA sequencing technology. In this chapter, we focus on inference of genetic networks from gene expression time series data (see Figure 2.1) because many studies have been done based on this kind of data and many of the developed methods and techniques may be applied to mRNA seq data.

Block Observed data on left has rectangular blocks v1(t)- v3(t) with fluctuating curves. Arrow inference leads to block genetic network with six points v1-v6, arrows connecting them. — **Figure 2.1** Inference of a genetic network from gene expression time series data.

Various mathematical models have been applied and/or developed to infer genetic networks from gene expression profiles, which include Boolean networks, Bayesian networks, dynamic Bayesian networks, linear and nonlinear differential equation models, and graphical Gaussian models (GGMs). Since too many models and methods have been developed and applied, it is impossible to review all or many of them. Therefore, in this chapter, we focus on basic and simple models and methods, and try to explain basic ideas/concepts in them. Readers interested in more advanced models and methods are referred to more comprehensive review articles [6, 18, 23, 48].

2.2 BOOLEAN NETWORKS

A Boolean network is a discrete model of genetic networks [22]. It is one of the simplest models of genetic networks and was proposed in 1960s [21].

A Boolean network G(V, F) consists of a set V = {v₁, …, v_n} of nodes and a list F = (f¹, …, fⁿ) of Boolean functions, where we use n to denote the number of nodes throughout this chapter. Each node corresponds to a gene and takes either 0 (gene is expressed) and 1 (gene is not expressed) at each discrete time t. The state of node v_i at time t is denoted by v_i(t), and the states of all nodes change synchronously according to given regulation rules.¹ A Boolean function with inputs from specified nodes is assigned to each node, where it represents a regulation rule for v_i and the (ordered) set of input nodes (parent nodes) is denoted by pa(v_i). The state of node v_i at time t + 1 is determined by

The vector consisting of the states of all nodes v(t) = (v₁(t), …, v_n(t)) is called a (global) state of a Boolean network at time t. We also write v_i(t + 1) = fⁱ(v(t)) to denote the regulation rule for v_i and v(t + 1) = f(v(t)) to denote the regulation rule for the whole network.

Diagram has three circles v1, v2, v3. Arrows connect v1 to v2, v3; v2 to v3; v3 to v1, v2. Three equations beside - v1(t+1) = v3(t), v2(t+1) = v1(t) ᴧ v3̅(̅t̅)̅, v3(t+1) = v1(t) ᴧ v2̅(̅t̅)̅. — **Figure 2.2** Example of a Boolean network.

The structure of a Boolean network can be represented by a directed graph G(V, E), where E is the set of edges defined by . An edge from to v_i means that directly affects the expression of v_i. The number of input nodes to v_i (i.e., |pa(v_i)|) is called the indegree of v_i and we use K to denote the maximum indegree of a Boolean network.

Figure 2.2 shows an example of a Boolean network. In this example, the state of node v₁ at time t + 1 is determined by the state of node v₃ at time t, the state of node v₂ at time t + 1 is determined by the logical AND of the state of v₁ and the negation (i.e., logical NOT) of the state of v₃ at time t, and the state of node v₃ at time t + 1 is determined by AND of the state of node v₁ and NOT of the state of node v₂ at time t, where we use x∧y and to denote logical AND of x and y, and logical NOT of x, respectively.

The dynamics of a Boolean network can be well described by a state transition diagram (see Figure 2.3 for an example). In this diagram, each node corresponds to a global state and each edge corresponds to a transition of global states from time t to t + 1. For example, an edge from 110 to 010 in Figure 2.3 means that if the global state is [1, 1, 0] at time t, the global state becomes [0, 1, 0] at time t + 1. It is also seen that if v(0) = [1, 1, 0], the global state changes as

and the same global state [0, 0, 0] is repeated after t = 1. It is also seen if v(0) = [0, 0, 1], [1, 0, 0], and [0, 1, 1] are repeated alternatively after t = 0. Sets of repeating states are called attractors. Each attractor corresponds to a directed cycle in a state transition table, and the number of elements in an attractor is called the period of the attractor. An attractor with period 1 is called a singleton attractor, whereas an attractor with period greater than 1 is called a cyclic attractor. In Figure 2.3, we can see two singleton attractors {[1, 0, 1]} and {[0, 0, 0]}, and one cyclic attractor {[1, 0, 0], [0, 1, 1]} with period 2. Because of an interpretation that attractors correspond to types of cells, extensive studies have been done on distribution of attractors [10, 43] although no conclusive result has yet obtained. Many studies have also been done to efficiently detect attractors [11, 20, 32]. However, it is known that detection of an attractor is non-deterministic polynomial-time hard (NP-hard) [1].

Boolean network diagram has three flowcharts- on left block 101 leads to itself; in middle block 110 leads to 010 to 000 to itself; on left 111, 001 leads to 100 to 011 and back to 100. — **Figure 2.3** State transition diagram for the Boolean network in Figure 2.2.

The inference problem of a Boolean network from time series gene expression data is defined as follows. Let (I^j, O^j) (j = 1, …, m) be a pair of expression profiles (i.e., 0–1 vectors) of v₁, …, v_n, where I^j corresponds to a global state at time t and O^j corresponds to a global state at time t + 1. I^j_i (resp. O^j_i) denotes the expression (0 or 1) of gene v_i in I^j (resp. O^j). Each pair (I^j, O^j) is called a sample. We say that a Boolean function fⁱ assigned to node v_i is consistent with a sample (I^j, O^j) if holds, and G(V, F) is consistent with (I^j, O^j) if all Boolean functions in F are consistent with (I^j, O^j). For a set of samples EX = {(I¹, O¹), (I², O²), …, (I^m, O^m)}, we say that G(V, F) (resp., node v_i) is consistent with EX if G(V, F) (resp., node v_i) is consistent with all (I^j, O^j) for 1 j m. Then, the inference problem is defined as follows: given the number of nodes n and a set of samples EX = {(I^j, O^j) | j = 1, …, m}, find a Boolean network with n nodes that is consistent with EX, where “none” should be reported if it does not exist. For example, suppose that the following set of samples are given:

Then, the following functions are consistent with EX:

whereas the following functions are inconsistent with EX:

In general, this inference problem is NP-hard [15]. However, if the maximum indegree K is a constant (e.g., 2 or 3), this problem can be solved in polynomial time by exhaustive search [2]. Liang et al. developed a faster algorithm by making use of mutual information [29]. Fukagawa and Akutsu showed that a simple greedy-type algorithm can correctly identify a Boolean network for most cases with high probability if samples are given uniformly at random [15].

Akutsu et al. studied the sample complexity in order to uniquely identify a Boolean network (i.e., how many samples are needed so that there exists at most only one Boolean network consistent with the samples) and showed that O(K2^{2K + 1}log n) samples are enough on the average under the assumption that each sample (i.e., I^j) is generated uniformly at random [2], where the constant factor was improved by Perkins and Hallett [39]. Since 2ⁿ samples are required (i.e., the whole state transition diagram should be given) if there is no restriction on K, this result suggests that only a small part of the state transition diagram is enough for identification of a Boolean network if K is a constant. It should also be noted that the assumption of uniform randomness does not hold for real-time series data and thus more samples are required in practice.

Recently, a new approach based on semi-tensor product has been proposed for analysis of Boolean networks, and has been applied to various problems, which include stability analysis, identification, and control [4].

2.3 PROBABILISTIC BOOLEAN NETWORK

In a Boolean network, the global state at time t + 1 is determined uniquely from the global state at time t. However, real genetic networks may not necessarily work deterministically. Therefore, it is more realistic to assume that genetic networks work stochastically by means of the effects of noise and elements other than genes. In order to introduce such effects, Shmulevich et al. proposed the probabilistic Boolean network (PBN) [46, 47].²

A PBN is an extension of a Boolean network. The difference between two models lies only on that multiple Boolean functions fⁱ_j(j = 1, 2, …, l(i)) are assigned to each node v_i and one of them is randomly chosen at each time step according to the probability assigned to each function. At each time step, fⁱ_j is chosen with pre-defined probability cⁱ_j ⩾ 0, where cⁱ_j should satisfy ∑^l(i)_{j = 1}cⁱ_j = 1 for each i = 1, 2, …, n. Let f_j be the jth possible realization of a Boolean network:

where N = ∏ⁿ_{i = 1}l(i) is the maximum possible number of different realizations of Boolean networks. The probability of choosing such a realization is given by

since the selection of the Boolean function for each node is independent.

Figure 2.4 shows an example of PBN. Suppose that the global state of PBN at time t is [0, 0, 0]. If (f¹₁, f²₁, f³₁) is selected with probability 0.8 × 0.7 = 0.56, the global state at time t + 1 remains [0, 0, 0]. Similarly, if (f¹₁, f²₁, f³₂) is selected with probability 0.8 × 0.3 = 0.24, the global state at time t + 1 remains [0, 0, 0]. On the other hand, if (f¹₂, f²₁, f³₁) is selected with probability 0.2 × 0.7 = 0.14 or (f¹₂, f²₁, f³₂) is selected with probability 0.2 × 0.3 = 0.06, the global state at time t + 1 becomes [1, 0, 0]. Therefore, we have the following transition probabilities:

Diagram has three circles v1, v2, v3. Arrows connect v1 to v2, v3; v2 to v3; v3 to v1, v2. Table beside has two columns - Boolean function, Probability; three rows - f1
1, f2
1; f1
2; f1
3, f2
3. — **Figure 2.4** Example of a probabilistic Boolean network.

where the probabilities of the transitions from [0, 0, 0] to the other states are 0. For another example, the (nonzero) transition probabilities from [0, 1, 0] are as follows:

Transitions of global states in a Boolean network were represented by a state transition diagram. In a PBN, transitions are well represented by using 2ⁿ × 2ⁿ matrix A. For each 0–1 vector a = [a₁, …, a_n] of size n, we assign an integer number id(a) between 1 and 2ⁿ by id(a) = 2^{n − 1}a₁ + 2^{n − 2}a₂ + ⋅⋅⋅ + 2a_{n − 1} + a_n + 1. Then, A is defined by

where i = id(b) and j = id(a). For example, the transition matrix of the PBN of Figure 2.4 is as follows:

The first column of this matrix represents the transition probabilities from [0, 0, 0]; the second column represents the transition probabilities from [0, 0, 1]; and so on. Since this matrix corresponds to a state transition matrix in a standard Markov chain (with 2ⁿ states), the techniques developed in the field of Markov chain can be applied to PBNs.

In order to infer a PBN from gene expression time series data, we need to determine Boolean functions assigned to each node and the corresponding probabilities. For inference of Boolean functions, exhaustive search or heuristic methods are usually used. For determination of probabilities, the coefficient of determination is widely used, which is a measure of the relative decrease in error from estimating the state of v_i via fⁱ_j rather than by just the best constant estimate. Many studies have been done on simulation of biological systems using PBNs and on finding control strategies for biological systems. For details of inference, simulation, and control studies on PBNs, see a recent review in Ref. [53].

2.4 BAYESIAN NETWORK

The Bayesian network is a statistical model, and has been developed and studied in the fields of statistics and artificial intelligence. A Bayesian network is usually defined by a directed acyclic graph and conditional probabilities.

Let G(V, E) be a directed acyclic graph with a set of nodes V and a set of directed edges E, where acyclic means that there is no directed cycle (i.e., there is no path beginning and ending with the same node). As in Section 2.2, pa(v_i) denotes the ordered set of input nodes to v_i. In the rest of this section, we use X_i to denote both the ith node (i.e., v_i) and random variable assigned to this node. In Figure 2.5(a), pa(X₁) = pa(X₂) = ∅ and pa(X₃) = (X₁, X₂). In the basic version, each X_i takes discrete values. For example, we consider a case in which each X_i takes either 0 or 1. Then, logical AND (i.e., X₃ = X₁∧X₂) can be represented by the following conditional probabilities:

Bayesian network has four diagrams a-d, each with three circles X1, X2 on top, X3 at bottom. In a arrows from X1, X2 leads to X3; in b X1 to X3 to X2; in c X3 to X1, X2; in d X1 to X2, X3 and X3 to X2. — **Figure 2.5** Examples of Bayesian networks.

We also denote by P(X_i | pa(X_i)) the conditional probability of X_i given pa(X_i). Using conditional probabilities, the probability of the whole network is given by

For each of the networks given in Figure 2.5, this probability is given as follows:

P(X₁, X₂, X₃) = P(X₃ | X₁, X₂)P(X₁)P(X₂),
P(X₁, X₂, X₃) = P(X₃ | X₂)P(X₂ | X₁)P(X₁),
P(X₁, X₂, X₃) = P(X₁ | X₃)P(X₂ | X₃)P(X₃),
P(X₁, X₂, X₃) = P(X₂ | X₁, X₃)P(X₃ | X₁)P(X₁).

For inference of Bayesian network structure G = G(V, E) from observed data D, we use the following relation:

In order to calculate P(D | G), a family of probability distributions characterized by a set of parameters θ is employed. Then, P(D | G) is given by

Although computation of this score is very difficult in general, it can be given by a simple formula in some special but useful cases [7].

Then, the inference problem is to find G maximizing P(D | G) (under the assumption that the prior probability P(G) is the same for all G). However, this problem is proved to be NP-hard [5]. Therefore, such heuristic algorithms as greedy algorithms and simulated annealing have been utilized. However, it is reported that an optimal solution can be obtained if n is small (e.g., n 30) [38]. In the case of application of Bayesian networks to inference of genetic networks, the domain consisting of three states, { − 1, 0, 1}, has been used [13], where 0, − 1, and 1 mean the neutral, down-regulated, and up-regulated levels, respectively.

Various extensions and modifications of Bayesian networks have been developed and applied for inference of genetic networks. Although we have considered discrete values for variables, continuous values can also be used. For example, Imoto et al. developed a method to infer Bayesian networks with continuous variables by combining nonparametric regression and information criteria [19].

Another important variant of Bayesian networks is the dynamic Bayesian network [40]. In dynamic Bayesian networks, random variable X_i depends on time and is represented as X_i(t). Furthermore, the state of X_i(t + 1) depends on states of pa(X_i)(t) (states of pa(X_i) at time t). Therefore, dynamic Bayesian networks are defined via the conditional probability,

Different from usual Bayesian networks, loops are allowed in dynamic Bayesian networks. However, time series data are necessary to infer dynamic Bayesian networks although usual Bayesian networks can be inferred from static data (e.g., data obtained by gene knockout and overexpression experiments).

If we consider the binary domain (i.e., each variable takes 0 or 1), dynamic Bayesian networks are almost the same as PBNs. This relationship is discussed in Ref. [26]. Dynamic Bayesian networks have been further extended to time-delayed Bayesian networks in which X_i(t + 1) depends not only on pa(X_i)(t) but also on pa(X_i)(t − 1), …, pa(X_i)(t − h), where h is some positive integer, and have been applied to inference of genetic networks [30].

2.5 GRAPHICAL GAUSSIAN MODELING

A graphical Gaussian model is a statistical model based on conditional probabilities and Gaussian distributions. A GGM is represented with an undirected graph in which each edge represents the pairwise correlation between two nodes conditioned against the correlations with all other nodes. In the GGM approach, indirect association effects are eliminated by evaluating conditional dependencies in multivariate Gaussian distributions (see also Figure 2.6). It is to be noted that undirected networks are inferred using the GGM approach, different from Boolean and Bayesian network approaches.

As discussed in Sections 2.2 and 2.3, we assume that gene expression time series data are given as v_i(t) (i = 1, …, n, t = 1, …, m). However, different from Boolean and Bayesian models, we assume that v_i(t) takes real values. The standard Pearson correlation coefficient r_ij between v_i and v_j is defined as

Two diagrams have five circles v1-v5, solid lines connect v1 to v2; v2 to v5, v3; v5 to v4. Dotted lines from v1 to v5, v4, v3 in left diagram. Arrow use of partial correlations in between. — **Figure 2.6** Inference of a genetic network using partial correlations. Edges (shown by dotted lines) with small partial correlations are (repeatedly) eliminated in the graphical Gaussian model.

where represents the mean value of v_i(t) (i.e., ). It is to be noted that is the standard deviation of v_i(t). Let R be the correlation matrix consisting of r_ijs (i.e., R = (r_ij)) and let R^{− 1} = (r^ij) be the inverse matrix of R. Then, the partial correlation coefficient p_ij is defined as

Let P be the partial correlation matrix consisting of p_ijs (i.e., P = (p_ij)), where the partial correlation means the pairwise correlation between two nodes conditioned against the correlations with all other nodes. Therefore, if there is no direct correlation between nodes v_i and v_j, p_ij = 0 holds (under a linear Gaussian model). However, in practice, we cannot expect that p_ij = 0 holds exactly even if there is no direct correlation. Therefore, we delete edge(s) whose partial correlation is minimum or less than some threshold. Furthermore, we can apply computation of partial correlation and deletion of edges repeatedly, where R for the next step is computed from the modified P in which 0 is assigned to p_ijs corresponding to deleted edges.

GGMs have been applied to inference of genetic networks by many groups [8, 25, 45, 52, 55]. Other methods for eliminating indirect edges have also been developed and applied. For example, mutual information [31] and entropy maximization [28] have been applied for that purpose.

Recently, new approaches for eliminating indirect correlations have been proposed: the silencing method [3] and the network deconvolution method [12]. The silencing method is based on a dynamical model. In this method, the global response matrix,

and the local response matrix,

are used, where we use v_i to denote both a node and its expression value. G captures the change of v_i’s activity in response to changes of v_j’s activity, and can be obtained from gene knockout and overexpression experiments. It is to be noted that G includes the effects of indirect correlations whereas S consists of only direct correlations. Therefore, G and S correspond to observed correlations and direct correlations, respectively. These two matrices are related by

which leads to a matrix-based method to compute S from G.

In the network deconvolution method, it is assumed that the true network is given in the form of n × n matrix , and the observed network has the form of

where I is the identity matrix. Then, can be computed from by

It is to be noted that

corresponds to indirect correlations.

In the above-mentioned methods, only elimination of spurious edges/correlations is considered. However, it may also be useful to modify a known network by addition and deletion of edges so that the resulting network is more consistent with observed data. This approach is called network completion (see Figure 2.7) and various methods have been proposed although the target networks do not necessarily include genetic networks. Guimerà and Sales-Pardo developed a method that can identify both missing and spurious interactions by using the stochastic block models [16]. Hanneke and Xing [17] developed a method based on random sampling of subnetworks and analyzed confidence intervals from samples [8]. Kim and Leskovec developed the KronEM algorithm by combining the expectation–maximization (EM) method with the Kronecker graphs [24]. Nakajima et al. developed a network completion method by combining dynamic programming and least-squared fitting [36].

Two diagrams - initial, completed network have central circle with five circles around connected by arrows. Block time series data points to arrow in between for deleted, added edge. — **Figure 2.7** Illustration of network completion.

2.6 DIFFERENTIAL EQUATIONS

Differential equations have been widely used in many areas in science and engineering. They have also been used for describing dynamics of various biological systems. Therefore, it is reasonable to use differential equations for inference of genetic networks, and many such studies have been done.

First, we consider inference methods based on linear differential equations. Let V = {v₁, …, v_n} be a set of nodes (i.e., a set of genes). Let v_i(t) denote the expression level of gene v_i at time t. Then, we assume that v_i(t)s (i = 1, …, n) are determined by the following differential equations:

where a_{i, j}s are parameters. We relate these parameters to the network structure by the correspondence that there exists a directed edge (v_j, v_i) if and only if a_{i, j} ≠ 0. In practice, “a_{i, j} ≠ 0” should be replaced by “|a_{i, j}| > δ” for some positive constant δ. Furthermore, since only gene expression data for discrete time steps are available, we approximate the differential equations with difference equations:

where Δ is the unit time (i.e., period between two consecutive time steps). The inference problem based on these difference equations is defined as follows: given gene expression time series data v_i(t) for i = 1, …, n and t = 0, Δ, 2Δ, …, (m − 1)Δ, infer parameters a_{i, j} for i = 1, …, n, j = 0, …, n that satisfy the above difference equations.

Then, the difference equations give a system of linear equations in which v_i(t)s are constants (since they are given from gene expression data) and a_{i, j}s are variables to be determined. However, we cannot expect that this system has a unique solution because the number of equations n(m − 1) is not necessarily equal to the number of variables n(n + 1). In practice, the number of time points m is much smaller than the number of genes n. Furthermore, we need to consider the effect of observation and system noises. In order to cope with these problems, linear regression, precisely, linear regression using least-squares fitting, has been applied [9, 49]. In this approach, we find a set of parameters, a_{i, j}s, that minimizes the sum of least-squared errors:

This minimization problem can be solved by using matrix computations.

Although the above-mentioned approach is easy to implement, it may cause overfitting owing to unbalance between the number of data and the number of variables. One way to cope with the overfitting problem is to introduce a regularization term in the objective function, where L₁ and L₂ regularization terms have widely been used [18]. For example, in the case of L₁ regularization (least absolute shrinkage and selection operator), the objective function is modified as

where λ is a parameter to be adjusted (usually by trial and error). It is also useful to incorporate with information criteria such as Akaikes information criteria and Bayesian information criteria [37].

Next, we consider inference based on nonlinear differential equations. Of course, various types of nonlinear differential equations have been applied and thus it is impossible to introduce all or many of them. Therefore, we only introduce the S-system [44] because extensive studies have been done on inference of S-systems. In the S-system modeling, differential equations have the following form:

where α_i, β_i, g_{i, j}, and h_{i, j} (i = 1, …, n, j = 1, …, n) are parameters to be estimated from time series data v_i(t). Again, we approximate these differential equations by difference equations. Different from the linear case, no simple and efficient method is known for determining parameters for this model (and most of other nonlinear models). Therefore, use of evolutionary computation methods is a good choice. Details of such methods are explained in Chapter 5 (by Kimura) and Chapter 8 (by Chowdhury and Chetty) of this book along with other types of nonlinear differential equations.

In addition to differential equation-based modeling, inference methods based on mass-action modeling have also been proposed. In this approach, each biochemical reaction R is represented in the following form:

where {S₁, …, S_N} denotes a set of different molecular species occurring in R and a_i, b_i are non-negative integer parameters called the stoichiometric coefficients. Nobel et al. developed a method for inference of genetic networks based on mass-action modeling by combining Cartesian genetic programming and particle swarm optimization. They also discuss an extension of the method to support stochastic simulation algorithms like the Gillespie algorithm. For details of mass-action modeling based approaches, see Chapter 6 (by Nobile et al.).

Expression versus time ranging 1-13 graph has three fluctuating curves for gene 1-3. Arrow inference from graph point to three blocks below linked by arrow change at 5, 10 on time axis. — **Figure 2.8** Illustration of inference of time-varying networks. In this problem, it is required to identify both change points and network structures.

2.7 TIME-VARYING NETWORK

We have seen various models and methods for inference of genetic networks. All the reviewed models and methods assume that the topology of the network does not change through time. However, the real gene regulatory network in a cell might dynamically change its structure depending on the effects of certain shocks, cell division, and so on. Therefore, it is also important to develop mathematical models and methods for inferring time-varying genetic networks based on time series gene expression data (see Figure 2.8). In this problem, it is required to infer both change time points and network structures between two consecutive change time points (including the starting and end time points).

For this problem, various models and methods have been proposed. For example, Yoshida et al. developed a dynamic linear model with Markov switching that represents change points in regimes, which evolve according to a first-order Markov process [54]. Robinson and Hartemink proposed a nonstationary dynamic Bayesian network, based on dynamic Bayesian networks [42]. Fujita et al. proposed a method based on the dynamic autoregressive model by extending the vector autoregression model [14]. Lèbre et al. developed another autoregressive time-varying algorithm by adopting a combination of reversible jump Markov chain Monte Carlo and dynamic Bayesian networks [27]. Thorne and Stumpf developed a method based on modeling of the regulatory network structure between distinct segments with a set of hidden states by applying the hierarchical Dirichlet process and hidden Markov model [51]. Rassol and Bouaynaya developed a method based on constrained and smoothed Kalman filtering [41]. Nakajima and Akutsu extended a network completion method based on dynamic programming and least-squares fitting [36] for time-varying networks [35].

2.8 CONCLUSION

In this chapter, we have briefly reviewed mathematical models and computational methods for inference of genetic networks from gene expression time series data. Since too many methods have been proposed, we could only review basic and simple models and methods. Although we have focused on inference from gene expression data only, integration of various types of data can also be useful for inference of genetic networks, and various methods have been proposed [18] (see also Chapter 3 (by Zhang)).

It is quite difficult to assess which inference method is the best (or better). In order to assess inference methods, we need to know the true genetic networks or gold standard networks. However, it is quite difficult to determine true genetic networks by biological experiments because genes are not directly regulated; they are regulated in a complex way via mRNA, proteins, and other factors (e.g., epigenetic factors). This situation is very different from metabolic networks and protein–protein interaction networks although it is still difficult to determine true protein–protein interactions by biological experiments.

For protein structure prediction, Critical Assessment of protein Structure Prediction (CASP) has been playing an important role in assessment and development of prediction methods [34]. The success of CASP may be due to that accurate structures can be determined by X-ray and NMR experiments. In order to assess inference and analysis methods for gene expression data, DREAM (Dialogue for Reverse Engineering Assessments and Methods) [33] and CAMDA (Critical Assessment of Techniques for Microarray Data Analysis) [50] have been organized, where CAMDA changed its name and scope to Critical Assessment of Massive Data Analysis. Different from CASP, it is difficult to provide the benchmark data and assess the results. Therefore, these projects provide various types of data, which include real gene expression data and simulated gene expression data. It is to be noted that the gold standards are available to assess the prediction results when simulated gene expression data are used. For example, in a recent challenge [34], two subchallenges were proposed: one for parameter estimation for given topology and biochemical structure of a 9-gene regulatory network, and the other for prediction of three missing links for an incomplete topology with 11 genes. An interesting point of this recent challenge is that participants are allowed to (virtually) buy several kinds but limited amount of experimental data that are generated in silico. The benchmark data provided in these projects are becoming useful to develop and evaluate new inference and analysis methods.

NOTES

REFERENCES

T. Akutsu, S. Kuhara, O. Maruyama, and S. Miyano. A system for identifying genetic networks from gene expression patterns produced by gene disruptions and overexpressions. Genome Informatics, 9:151–160, 1998.
T. Akutsu, S. Miyano, and S. Kuhara. Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. In: R. B. Altman, A. K. Dunker, L. Hunter, T. E. Klein (eds.). Proceedings of Pacific Symposium on Biocomputing 1999, World Scientific, Singapore, pp. 17–28, 1999.
B. Barzel and A.-L. Barabási. Network link prediction by global silencing of indirect correlations. Nature Biotechnology, 31:720–725, 2013.
D. Cheng, H. Qi, and Z. Li. Analysis and Control of Boolean Networks. A Semi-Tensor Product Approach. Springer, Heidelberg, 2011.
D. M. Chickering. Learning Bayesian networks is NP-complete. In: D. Fisher and H.-J. Lenz (eds). Learning from Data: Artificial Intelligence and Statistics V, Springer, Heidelberg, 1996.
K.-H. Cho, S.-M. Choo, S. H. Jung, J.-R. Kim, H.-S. Choi, and J. Kim. Reverse engineering of gene regulatory networks. IET Systems Biology, 1:149–163, 2007.
G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309–347, 1992.
A. de la Fuente, N. Bing, I. Hoeschele, and P. Mendes. Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics, 20:3565–3574, 2004.
P. D’haeseleer, X. Wen, S. Fuhrman, and R. Somogyi. Linear modeling of mRNA expression levels during CNS development and injury. In: R. B. Altman, A. K. Dunker, L. Hunter, T. E. Klein (eds.). Proceedings of Pacific Symposium on Biocomputing 1999, World Scientific, Singapore, pp. 41–52, 1999.
B. Drossel, T. Mihaljev, and F. Greil. Number and length of attractors in a critical Kauffman model with connectivity one. Physical Review Letters, 94:088701, 2005.
E. Dubrova and M. Teslenko. A SAT-based algorithm for finding attractors in synchronous Boolean networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8:1393–1398, 2011.
S. Feizi, D. Marbach, M. Médard, and M. Kellis. Network deconvolution as a general method to distinguish direct dependencies in networks. Nature Biotechnology, 31:726–733, 2013.
N. Friedman, N. Linial, I. Nachman, and D. Pe’er. Using Bayesian networks to analyze expression data. Journal of Computational Biology, 7:601–620, 2000.
A. Fujita, J. R. Sato, H. M. Garay-Malpartida, P. A. Morettin, M. C. Sogayar, and C. E. Ferreira. Time-varying modeling of gene expression regulatory networks using the wavelet dynamic vector autoregressive method. Bioinformatics, 23:1623–1630, 2007.
D. Fukagawa and T. Akutsu. Performance analysis of a greedy algorithm for inferring Boolean functions. Information Processing Letters, 93:7–12, 2005.
R. Guimerà and M. Sales-Pardo. Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences of the United States of America, 106:22073–22078, 2009.
S. Hanneke and E. P. Xing. Network completion and survey sampling. Journal of Machine Learning Research, 5:209–215, 2009.
M. Hecker, S. Lambeck, S. Toepfer, E. van Someren, and R. Guthke. Gene regulatory network inference: data integration in dynamic models—a review. Biosystems, 96:86–103, 2009.
S. Imoto, T. Higuchi, T. Goto, K. Tashiro, S. Kuhara, and S. Miyano. Combining microarrays and biological knowledge for estimating gene networks via Bayesian networks. Journal of Bioinformatics and Computational Biology, 2:77–98, 2004.
D. J. Irons. Improving the efficiency of attractor cycle identification in Boolean networks. Physica D, 217:7–21, 2006.
S. A. Kauffman. Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of Theoretical Biology, 22:437–467, 1969.
S. A. Kauffman. The Origins of Order: Self-organization and Selection in Evolution, Oxford University Press, New York, 1993.
G. Karlebach and R. Shamir. Modelling and analysis of gene regulatory networks. Nature Reviews Molecular Cell Biology, 9:770–780, 2008.
M. Kim and J. Leskovec. The network completion problem: inferring missing nodes and edges in networks. In: B. Liu, H. Liu, C. Clifton, T. Washio, C. Kamath (eds.). Proceedings of the 2011 SIAM International Conference on Data Mining, SIAM, Philadelphia, pp. 47–58, 2011.
H. Kishino and P. J. Waddell. Correspondence analysis of genes and tissue types and finding genetic links from microarray data. Genome Informatics, 11:83–95, 2000.
H. Lähdesmäki, S. Hautaniemi, I. Shmulevich, and O. Yli-Harja. Relationships between probabilistic Boolean networks and dynamic Bayesian networks as models of gene regulatory networks. Signal Processing, 86:814–834, 2006.
S. Lèbre, J. Becq, F. Devaux, M. P. H. Stumpf, and G. Lelandais. Statistical inference of the time-varying structure of gene-regulation networks. BMC Systems Biology, 4:130, 2010.
T. R. Lezon, J. R. Banavar, M. Cieplak, A. Maritan, and N. V. Fedoroff. Using the principle of entropy maximization to infer genetic interaction networks from gene expression patterns. Proceedings of the National Academy of Sciences of the United States of America, 103:19033–19038, 2006.
S. Liang, S. Fuhrman, and R. Somogyi. REVEAL, a general reverse engineering algorithm for inference of genetic network architectures. In: R. B. Altman, A. K. Dunker, L. Hunter, T. E. Klein (eds.). Proceedings of Pacific Symposium on Biocomputing 1998, World Scientific, Singapore, pp. 18–29, 1998.
T.-F. Liu, W.-K. Sung, and A. Mittal. Learning gene network using time-delayed Bayesian framework. International Journal on Artificial Intelligence Tools, 15:353–370, 2006.
A. A. Margolin, K. Wang, W. K. Lim, M. Kustagi, I. Nemenman, and A. Califano. Reverse engineering cellular networks. Nature Protocols, 1:662–671, 2006.
A. A. Melkman and T. Akutsu. An improved satisfiability algorithm for nested canalyzing functions and its application to determining a singleton attractor of a Boolean network. Journal of Computational Biology, 20:958–969, 2013.
P. Meyer, et al. Network topology and parameter estimation: from experimental design methods to gene regulatory network kinetics using a community based approach. BMC Systems Biology, 8:13, 2014.
J. Moult, K. Fidelis, A. Kryshtafovych, T. Schwede, and A.Tramontano. Critical assessment of methods of protein structure prediction (CASP) round x. Proteins: Structure, Function, and Bioinformatics, 82(S2):16, 2014.
N. Nakajima and T. Akutsu. Exact and heuristic methods for network completion for time-varying genetic networks. BioMed Research International, 2014:684014, 2014.
N. Nakajima, T. Tamura, Y. Yamanishi, K. Horimoto, and T. Akutsu. Network completion using dynamic programming and least-squares fitting. The ScientificWorld Journal, 2012:957620, 2012.
N. Noman, L. Palafox, and H. Iba. On model selection criteria in reverse engineering gene networks using RNN model. In: G. Lee, D. Howard, J. J. Kang, D. Ślzak (eds.). Proceedings of the International Conference on Convergence and Hybrid Information Technology, Springer, Berlin, pp. 155–164, 2012.
S. Ott, S. Imoto, and S. Miyano. Finding optimal models for small Gene networks. In: R. B. Altman, A. K. Dunker, L. Hunter, T. E. Klein (eds.). Proceedings of Pacific Symposium on Biocomputing 2004, World Scientific, Singapore, pp. 557–567, 2004.
T. J. Perkins and M. J. Hallett. A trade-off between sample complexity and computational complexity in learning Boolean networks from time-series data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7:118–125, 2010.
B.-E. Perrin, L. Ralaivola, A. Mazurie, S. Bottani, J. Mallet, and F. dAlchéBuc. Gene networks inference using dynamic Bayesian networks. Bioinformatics, 19:ii138–ii148, 2013.
G. Rassol and N. Bouaynaya. Inference of time-varying gene networks using constrained and smoothed Kalman filtering. In: Proceedings of International Workshop on Genomic Signal Processing and Statistics, IEEE, Piscataway, NJ, pp. 172–175, 2012.
J. Robinson and A. Hartemink. Non-stationary dynamic Bayesian networks. In: D. Koller, D. Schuurmans, Y. Bengio, L. Bottou (eds.). Proceedings of the 22nd Annual Conference on Neural Information Processing Systems, Curran Associates, Inc., Red Hook, NY, pp. 1369–1376, 2008.
B. Samuelsson and C. Troein. Superpolynomial growth in the number of attractors in Kauffman networks. Physical Review Letters, 90:098701, 2003.
M. Savageau. Biochemical Systems Analysis. A Study of Function and Design in Molecular Biology. Addison-Wesley, Massachusetts, 1976.
J. Schäfer and K. Strimmer. An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics, 21:754–764, 2005.
I. Shmulevich and E. R. Dougherty. Probabilistic Boolean Networks: The Modeling and Control of Gene Regulatory Networks. SIAM, Philadelphia, 2010.
I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang. Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics, 18:261–274, 2002.
C. Sima, J. Hua, and S. Jung. Inference of gene regulatory networks using time-series data: a survey. Current Genomics, 10:416–429, 2009.
E. P. van Someren, L. F. A. Wessels, and M. J. T. Reinders. Linear modeling of genetic networks from experimental data. In: P. E. Bourne, M. Gribskov, R. B. Altman, N. Jensen, D. A. Hope, T. Lengauer, J. C. Mitchell, E. D. Scheeff, C. Smith, S. Strande, H. Weissig (eds.). Proceeding of the 18th International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Palo Alto, CA, pp. 355–366, 2000.
C. Telstone. Vital statistics. Nature, 424:610–512, 2003.
T. Thorne and M. P. H. Stumpf. Inference of temporally varying Bayesian Networks. Bioinformatics, 28:3298–3305, 2012.
H. Toh and K. Horimoto. Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling. Bioinformatics, 18:287–297, 2002.
R. Trairatphisan, A. Mizera, J. Pang, A. A. Tantar, H. Schneider, and T. Sauter. Recent development and biomedical applications of probabilistic Boolean networks. Cell Communication and Signaling, 11:46, 2013.
R. Yoshida, S. Imoto, and T. Higuchi. Estimating time-dependent gene networks from time series microarray data by dynamic linear models with Markov switching. In: V. Markstein (ed). Proceedings of 4th Computational Systems Bioinformatics, IEEE, Piscataway, NJ, pp. 289–298. 2005.
Y. Yuan, C-T. Li, and O. Windram. Directed partial correlation: inferring large-scale gene regulatory network through induced topology disruptions. PLoS One, 6:e16835, 2011.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.