Tatsuya Akutsu
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Japan
Genes maintain organisms by interacting with each other through messenger ribonucleic acids (mRNAs), proteins, and other types of molecules. Interactions among genes are often represented as networks, which are called gene regulatory networks, or genetic networks in short. Genetic networks are usually represented as directed graphs, in which nodes correspond to genes and edges correspond to regulatory relationships between two genes.
Deciphering genetic networks is important for understanding complex cellular systems because they play important roles in cells through control of protein production. In order to infer genetic networks, various kinds of data have been used such as gene expression profiles (particularly mRNA expression profiles), CHromatin ImmunoPrecipitation-chip data for transcription binding information, deoxyribonucleic acid (DNA)–protein interaction data, and mRNA seq data generated by using the next-generation DNA sequencing technology. In this chapter, we focus on inference of genetic networks from gene expression time series data (see Figure 2.1) because many studies have been done based on this kind of data and many of the developed methods and techniques may be applied to mRNA seq data.
Various mathematical models have been applied and/or developed to infer genetic networks from gene expression profiles, which include Boolean networks, Bayesian networks, dynamic Bayesian networks, linear and nonlinear differential equation models, and graphical Gaussian models (GGMs). Since too many models and methods have been developed and applied, it is impossible to review all or many of them. Therefore, in this chapter, we focus on basic and simple models and methods, and try to explain basic ideas/concepts in them. Readers interested in more advanced models and methods are referred to more comprehensive review articles [6, 18, 23, 48].
A Boolean network is a discrete model of genetic networks [22]. It is one of the simplest models of genetic networks and was proposed in 1960s [21].
A Boolean network G(V, F) consists of a set V = {v1, …, vn} of nodes and a list F = (f1, …, fn) of Boolean functions, where we use n to denote the number of nodes throughout this chapter. Each node corresponds to a gene and takes either 0 (gene is expressed) and 1 (gene is not expressed) at each discrete time t. The state of node vi at time t is denoted by vi(t), and the states of all nodes change synchronously according to given regulation rules.1 A Boolean function with inputs from specified nodes is assigned to each node, where it represents a regulation rule for vi and the (ordered) set of input nodes (parent nodes) is denoted by pa(vi). The state of node vi at time t + 1 is determined by
The vector consisting of the states of all nodes v(t) = (v1(t), …, vn(t)) is called a (global) state of a Boolean network at time t. We also write vi(t + 1) = fi(v(t)) to denote the regulation rule for vi and v(t + 1) = f(v(t)) to denote the regulation rule for the whole network.
The structure of a Boolean network can be represented by a directed graph G(V, E), where E is the set of edges defined by . An edge from to vi means that directly affects the expression of vi. The number of input nodes to vi (i.e., |pa(vi)|) is called the indegree of vi and we use K to denote the maximum indegree of a Boolean network.
Figure 2.2 shows an example of a Boolean network. In this example, the state of node v1 at time t + 1 is determined by the state of node v3 at time t, the state of node v2 at time t + 1 is determined by the logical AND of the state of v1 and the negation (i.e., logical NOT) of the state of v3 at time t, and the state of node v3 at time t + 1 is determined by AND of the state of node v1 and NOT of the state of node v2 at time t, where we use x∧y and to denote logical AND of x and y, and logical NOT of x, respectively.
The dynamics of a Boolean network can be well described by a state transition diagram (see Figure 2.3 for an example). In this diagram, each node corresponds to a global state and each edge corresponds to a transition of global states from time t to t + 1. For example, an edge from 110 to 010 in Figure 2.3 means that if the global state is [1, 1, 0] at time t, the global state becomes [0, 1, 0] at time t + 1. It is also seen that if v(0) = [1, 1, 0], the global state changes as
and the same global state [0, 0, 0] is repeated after t = 1. It is also seen if v(0) = [0, 0, 1], [1, 0, 0], and [0, 1, 1] are repeated alternatively after t = 0. Sets of repeating states are called attractors. Each attractor corresponds to a directed cycle in a state transition table, and the number of elements in an attractor is called the period of the attractor. An attractor with period 1 is called a singleton attractor, whereas an attractor with period greater than 1 is called a cyclic attractor. In Figure 2.3, we can see two singleton attractors {[1, 0, 1]} and {[0, 0, 0]}, and one cyclic attractor {[1, 0, 0], [0, 1, 1]} with period 2. Because of an interpretation that attractors correspond to types of cells, extensive studies have been done on distribution of attractors [10, 43] although no conclusive result has yet obtained. Many studies have also been done to efficiently detect attractors [11, 20, 32]. However, it is known that detection of an attractor is non-deterministic polynomial-time hard (NP-hard) [1].
The inference problem of a Boolean network from time series gene expression data is defined as follows. Let (Ij, Oj) (j = 1, …, m) be a pair of expression profiles (i.e., 0–1 vectors) of v1, …, vn, where Ij corresponds to a global state at time t and Oj corresponds to a global state at time t + 1. Iji (resp. Oji) denotes the expression (0 or 1) of gene vi in Ij (resp. Oj). Each pair (Ij, Oj) is called a sample. We say that a Boolean function fi assigned to node vi is consistent with a sample (Ij, Oj) if holds, and G(V, F) is consistent with (Ij, Oj) if all Boolean functions in F are consistent with (Ij, Oj). For a set of samples EX = {(I1, O1), (I2, O2), …, (Im, Om)}, we say that G(V, F) (resp., node vi) is consistent with EX if G(V, F) (resp., node vi) is consistent with all (Ij, Oj) for 1 j m. Then, the inference problem is defined as follows: given the number of nodes n and a set of samples EX = {(Ij, Oj) | j = 1, …, m}, find a Boolean network with n nodes that is consistent with EX, where “none” should be reported if it does not exist. For example, suppose that the following set of samples are given:
Then, the following functions are consistent with EX:
whereas the following functions are inconsistent with EX:
In general, this inference problem is NP-hard [15]. However, if the maximum indegree K is a constant (e.g., 2 or 3), this problem can be solved in polynomial time by exhaustive search [2]. Liang et al. developed a faster algorithm by making use of mutual information [29]. Fukagawa and Akutsu showed that a simple greedy-type algorithm can correctly identify a Boolean network for most cases with high probability if samples are given uniformly at random [15].
Akutsu et al. studied the sample complexity in order to uniquely identify a Boolean network (i.e., how many samples are needed so that there exists at most only one Boolean network consistent with the samples) and showed that O(K22K + 1log n) samples are enough on the average under the assumption that each sample (i.e., Ij) is generated uniformly at random [2], where the constant factor was improved by Perkins and Hallett [39]. Since 2n samples are required (i.e., the whole state transition diagram should be given) if there is no restriction on K, this result suggests that only a small part of the state transition diagram is enough for identification of a Boolean network if K is a constant. It should also be noted that the assumption of uniform randomness does not hold for real-time series data and thus more samples are required in practice.
Recently, a new approach based on semi-tensor product has been proposed for analysis of Boolean networks, and has been applied to various problems, which include stability analysis, identification, and control [4].
In a Boolean network, the global state at time t + 1 is determined uniquely from the global state at time t. However, real genetic networks may not necessarily work deterministically. Therefore, it is more realistic to assume that genetic networks work stochastically by means of the effects of noise and elements other than genes. In order to introduce such effects, Shmulevich et al. proposed the probabilistic Boolean network (PBN) [46, 47].2
A PBN is an extension of a Boolean network. The difference between two models lies only on that multiple Boolean functions fij(j = 1, 2, …, l(i)) are assigned to each node vi and one of them is randomly chosen at each time step according to the probability assigned to each function. At each time step, fij is chosen with pre-defined probability cij ⩾ 0, where cij should satisfy ∑l(i)j = 1cij = 1 for each i = 1, 2, …, n. Let fj be the jth possible realization of a Boolean network:
where N = ∏ni = 1l(i) is the maximum possible number of different realizations of Boolean networks. The probability of choosing such a realization is given by
since the selection of the Boolean function for each node is independent.
Figure 2.4 shows an example of PBN. Suppose that the global state of PBN at time t is [0, 0, 0]. If (f11, f21, f31) is selected with probability 0.8 × 0.7 = 0.56, the global state at time t + 1 remains [0, 0, 0]. Similarly, if (f11, f21, f32) is selected with probability 0.8 × 0.3 = 0.24, the global state at time t + 1 remains [0, 0, 0]. On the other hand, if (f12, f21, f31) is selected with probability 0.2 × 0.7 = 0.14 or (f12, f21, f32) is selected with probability 0.2 × 0.3 = 0.06, the global state at time t + 1 becomes [1, 0, 0]. Therefore, we have the following transition probabilities:
where the probabilities of the transitions from [0, 0, 0] to the other states are 0. For another example, the (nonzero) transition probabilities from [0, 1, 0] are as follows:
Transitions of global states in a Boolean network were represented by a state transition diagram. In a PBN, transitions are well represented by using 2n × 2n matrix A. For each 0–1 vector a = [a1, …, an] of size n, we assign an integer number id(a) between 1 and 2n by id(a) = 2n − 1a1 + 2n − 2a2 + ⋅⋅⋅ + 2an − 1 + an + 1. Then, A is defined by
where i = id(b) and j = id(a). For example, the transition matrix of the PBN of Figure 2.4 is as follows:
The first column of this matrix represents the transition probabilities from [0, 0, 0]; the second column represents the transition probabilities from [0, 0, 1]; and so on. Since this matrix corresponds to a state transition matrix in a standard Markov chain (with 2n states), the techniques developed in the field of Markov chain can be applied to PBNs.
In order to infer a PBN from gene expression time series data, we need to determine Boolean functions assigned to each node and the corresponding probabilities. For inference of Boolean functions, exhaustive search or heuristic methods are usually used. For determination of probabilities, the coefficient of determination is widely used, which is a measure of the relative decrease in error from estimating the state of vi via fij rather than by just the best constant estimate. Many studies have been done on simulation of biological systems using PBNs and on finding control strategies for biological systems. For details of inference, simulation, and control studies on PBNs, see a recent review in Ref. [53].
The Bayesian network is a statistical model, and has been developed and studied in the fields of statistics and artificial intelligence. A Bayesian network is usually defined by a directed acyclic graph and conditional probabilities.
Let G(V, E) be a directed acyclic graph with a set of nodes V and a set of directed edges E, where acyclic means that there is no directed cycle (i.e., there is no path beginning and ending with the same node). As in Section 2.2, pa(vi) denotes the ordered set of input nodes to vi. In the rest of this section, we use Xi to denote both the ith node (i.e., vi) and random variable assigned to this node. In Figure 2.5(a), pa(X1) = pa(X2) = ∅ and pa(X3) = (X1, X2). In the basic version, each Xi takes discrete values. For example, we consider a case in which each Xi takes either 0 or 1. Then, logical AND (i.e., X3 = X1∧X2) can be represented by the following conditional probabilities:
We also denote by P(Xi | pa(Xi)) the conditional probability of Xi given pa(Xi). Using conditional probabilities, the probability of the whole network is given by
For each of the networks given in Figure 2.5, this probability is given as follows:
For inference of Bayesian network structure G = G(V, E) from observed data D, we use the following relation:
In order to calculate P(D | G), a family of probability distributions characterized by a set of parameters θ is employed. Then, P(D | G) is given by
Although computation of this score is very difficult in general, it can be given by a simple formula in some special but useful cases [7].
Then, the inference problem is to find G maximizing P(D | G) (under the assumption that the prior probability P(G) is the same for all G). However, this problem is proved to be NP-hard [5]. Therefore, such heuristic algorithms as greedy algorithms and simulated annealing have been utilized. However, it is reported that an optimal solution can be obtained if n is small (e.g., n 30) [38]. In the case of application of Bayesian networks to inference of genetic networks, the domain consisting of three states, { − 1, 0, 1}, has been used [13], where 0, − 1, and 1 mean the neutral, down-regulated, and up-regulated levels, respectively.
Various extensions and modifications of Bayesian networks have been developed and applied for inference of genetic networks. Although we have considered discrete values for variables, continuous values can also be used. For example, Imoto et al. developed a method to infer Bayesian networks with continuous variables by combining nonparametric regression and information criteria [19].
Another important variant of Bayesian networks is the dynamic Bayesian network [40]. In dynamic Bayesian networks, random variable Xi depends on time and is represented as Xi(t). Furthermore, the state of Xi(t + 1) depends on states of pa(Xi)(t) (states of pa(Xi) at time t). Therefore, dynamic Bayesian networks are defined via the conditional probability,
Different from usual Bayesian networks, loops are allowed in dynamic Bayesian networks. However, time series data are necessary to infer dynamic Bayesian networks although usual Bayesian networks can be inferred from static data (e.g., data obtained by gene knockout and overexpression experiments).
If we consider the binary domain (i.e., each variable takes 0 or 1), dynamic Bayesian networks are almost the same as PBNs. This relationship is discussed in Ref. [26]. Dynamic Bayesian networks have been further extended to time-delayed Bayesian networks in which Xi(t + 1) depends not only on pa(Xi)(t) but also on pa(Xi)(t − 1), …, pa(Xi)(t − h), where h is some positive integer, and have been applied to inference of genetic networks [30].
A graphical Gaussian model is a statistical model based on conditional probabilities and Gaussian distributions. A GGM is represented with an undirected graph in which each edge represents the pairwise correlation between two nodes conditioned against the correlations with all other nodes. In the GGM approach, indirect association effects are eliminated by evaluating conditional dependencies in multivariate Gaussian distributions (see also Figure 2.6). It is to be noted that undirected networks are inferred using the GGM approach, different from Boolean and Bayesian network approaches.
As discussed in Sections 2.2 and 2.3, we assume that gene expression time series data are given as vi(t) (i = 1, …, n, t = 1, …, m). However, different from Boolean and Bayesian models, we assume that vi(t) takes real values. The standard Pearson correlation coefficient rij between vi and vj is defined as
where represents the mean value of vi(t) (i.e., ). It is to be noted that is the standard deviation of vi(t). Let R be the correlation matrix consisting of rijs (i.e., R = (rij)) and let R− 1 = (rij) be the inverse matrix of R. Then, the partial correlation coefficient pij is defined as
Let P be the partial correlation matrix consisting of pijs (i.e., P = (pij)), where the partial correlation means the pairwise correlation between two nodes conditioned against the correlations with all other nodes. Therefore, if there is no direct correlation between nodes vi and vj, pij = 0 holds (under a linear Gaussian model). However, in practice, we cannot expect that pij = 0 holds exactly even if there is no direct correlation. Therefore, we delete edge(s) whose partial correlation is minimum or less than some threshold. Furthermore, we can apply computation of partial correlation and deletion of edges repeatedly, where R for the next step is computed from the modified P in which 0 is assigned to pijs corresponding to deleted edges.
GGMs have been applied to inference of genetic networks by many groups [8, 25, 45, 52, 55]. Other methods for eliminating indirect edges have also been developed and applied. For example, mutual information [31] and entropy maximization [28] have been applied for that purpose.
Recently, new approaches for eliminating indirect correlations have been proposed: the silencing method [3] and the network deconvolution method [12]. The silencing method is based on a dynamical model. In this method, the global response matrix,
and the local response matrix,
are used, where we use vi to denote both a node and its expression value. G captures the change of vi’s activity in response to changes of vj’s activity, and can be obtained from gene knockout and overexpression experiments. It is to be noted that G includes the effects of indirect correlations whereas S consists of only direct correlations. Therefore, G and S correspond to observed correlations and direct correlations, respectively. These two matrices are related by
which leads to a matrix-based method to compute S from G.
In the network deconvolution method, it is assumed that the true network is given in the form of n × n matrix , and the observed network has the form of
where I is the identity matrix. Then, can be computed from by
It is to be noted that
corresponds to indirect correlations.
In the above-mentioned methods, only elimination of spurious edges/correlations is considered. However, it may also be useful to modify a known network by addition and deletion of edges so that the resulting network is more consistent with observed data. This approach is called network completion (see Figure 2.7) and various methods have been proposed although the target networks do not necessarily include genetic networks. Guimerà and Sales-Pardo developed a method that can identify both missing and spurious interactions by using the stochastic block models [16]. Hanneke and Xing [17] developed a method based on random sampling of subnetworks and analyzed confidence intervals from samples [8]. Kim and Leskovec developed the KronEM algorithm by combining the expectation–maximization (EM) method with the Kronecker graphs [24]. Nakajima et al. developed a network completion method by combining dynamic programming and least-squared fitting [36].
Differential equations have been widely used in many areas in science and engineering. They have also been used for describing dynamics of various biological systems. Therefore, it is reasonable to use differential equations for inference of genetic networks, and many such studies have been done.
First, we consider inference methods based on linear differential equations. Let V = {v1, …, vn} be a set of nodes (i.e., a set of genes). Let vi(t) denote the expression level of gene vi at time t. Then, we assume that vi(t)s (i = 1, …, n) are determined by the following differential equations:
where ai, js are parameters. We relate these parameters to the network structure by the correspondence that there exists a directed edge (vj, vi) if and only if ai, j ≠ 0. In practice, “ai, j ≠ 0” should be replaced by “|ai, j| > δ” for some positive constant δ. Furthermore, since only gene expression data for discrete time steps are available, we approximate the differential equations with difference equations:
where Δ is the unit time (i.e., period between two consecutive time steps). The inference problem based on these difference equations is defined as follows: given gene expression time series data vi(t) for i = 1, …, n and t = 0, Δ, 2Δ, …, (m − 1)Δ, infer parameters ai, j for i = 1, …, n, j = 0, …, n that satisfy the above difference equations.
Then, the difference equations give a system of linear equations in which vi(t)s are constants (since they are given from gene expression data) and ai, js are variables to be determined. However, we cannot expect that this system has a unique solution because the number of equations n(m − 1) is not necessarily equal to the number of variables n(n + 1). In practice, the number of time points m is much smaller than the number of genes n. Furthermore, we need to consider the effect of observation and system noises. In order to cope with these problems, linear regression, precisely, linear regression using least-squares fitting, has been applied [9, 49]. In this approach, we find a set of parameters, ai, js, that minimizes the sum of least-squared errors:
This minimization problem can be solved by using matrix computations.
Although the above-mentioned approach is easy to implement, it may cause overfitting owing to unbalance between the number of data and the number of variables. One way to cope with the overfitting problem is to introduce a regularization term in the objective function, where L1 and L2 regularization terms have widely been used [18]. For example, in the case of L1 regularization (least absolute shrinkage and selection operator), the objective function is modified as
where λ is a parameter to be adjusted (usually by trial and error). It is also useful to incorporate with information criteria such as Akaikes information criteria and Bayesian information criteria [37].
Next, we consider inference based on nonlinear differential equations. Of course, various types of nonlinear differential equations have been applied and thus it is impossible to introduce all or many of them. Therefore, we only introduce the S-system [44] because extensive studies have been done on inference of S-systems. In the S-system modeling, differential equations have the following form:
where αi, βi, gi, j, and hi, j (i = 1, …, n, j = 1, …, n) are parameters to be estimated from time series data vi(t). Again, we approximate these differential equations by difference equations. Different from the linear case, no simple and efficient method is known for determining parameters for this model (and most of other nonlinear models). Therefore, use of evolutionary computation methods is a good choice. Details of such methods are explained in Chapter 5 (by Kimura) and Chapter 8 (by Chowdhury and Chetty) of this book along with other types of nonlinear differential equations.
In addition to differential equation-based modeling, inference methods based on mass-action modeling have also been proposed. In this approach, each biochemical reaction R is represented in the following form:
where {S1, …, SN} denotes a set of different molecular species occurring in R and ai, bi are non-negative integer parameters called the stoichiometric coefficients. Nobel et al. developed a method for inference of genetic networks based on mass-action modeling by combining Cartesian genetic programming and particle swarm optimization. They also discuss an extension of the method to support stochastic simulation algorithms like the Gillespie algorithm. For details of mass-action modeling based approaches, see Chapter 6 (by Nobile et al.).
We have seen various models and methods for inference of genetic networks. All the reviewed models and methods assume that the topology of the network does not change through time. However, the real gene regulatory network in a cell might dynamically change its structure depending on the effects of certain shocks, cell division, and so on. Therefore, it is also important to develop mathematical models and methods for inferring time-varying genetic networks based on time series gene expression data (see Figure 2.8). In this problem, it is required to infer both change time points and network structures between two consecutive change time points (including the starting and end time points).
For this problem, various models and methods have been proposed. For example, Yoshida et al. developed a dynamic linear model with Markov switching that represents change points in regimes, which evolve according to a first-order Markov process [54]. Robinson and Hartemink proposed a nonstationary dynamic Bayesian network, based on dynamic Bayesian networks [42]. Fujita et al. proposed a method based on the dynamic autoregressive model by extending the vector autoregression model [14]. Lèbre et al. developed another autoregressive time-varying algorithm by adopting a combination of reversible jump Markov chain Monte Carlo and dynamic Bayesian networks [27]. Thorne and Stumpf developed a method based on modeling of the regulatory network structure between distinct segments with a set of hidden states by applying the hierarchical Dirichlet process and hidden Markov model [51]. Rassol and Bouaynaya developed a method based on constrained and smoothed Kalman filtering [41]. Nakajima and Akutsu extended a network completion method based on dynamic programming and least-squares fitting [36] for time-varying networks [35].
In this chapter, we have briefly reviewed mathematical models and computational methods for inference of genetic networks from gene expression time series data. Since too many methods have been proposed, we could only review basic and simple models and methods. Although we have focused on inference from gene expression data only, integration of various types of data can also be useful for inference of genetic networks, and various methods have been proposed [18] (see also Chapter 3 (by Zhang)).
It is quite difficult to assess which inference method is the best (or better). In order to assess inference methods, we need to know the true genetic networks or gold standard networks. However, it is quite difficult to determine true genetic networks by biological experiments because genes are not directly regulated; they are regulated in a complex way via mRNA, proteins, and other factors (e.g., epigenetic factors). This situation is very different from metabolic networks and protein–protein interaction networks although it is still difficult to determine true protein–protein interactions by biological experiments.
For protein structure prediction, Critical Assessment of protein Structure Prediction (CASP) has been playing an important role in assessment and development of prediction methods [34]. The success of CASP may be due to that accurate structures can be determined by X-ray and NMR experiments. In order to assess inference and analysis methods for gene expression data, DREAM (Dialogue for Reverse Engineering Assessments and Methods) [33] and CAMDA (Critical Assessment of Techniques for Microarray Data Analysis) [50] have been organized, where CAMDA changed its name and scope to Critical Assessment of Massive Data Analysis. Different from CASP, it is difficult to provide the benchmark data and assess the results. Therefore, these projects provide various types of data, which include real gene expression data and simulated gene expression data. It is to be noted that the gold standards are available to assess the prediction results when simulated gene expression data are used. For example, in a recent challenge [34], two subchallenges were proposed: one for parameter estimation for given topology and biochemical structure of a 9-gene regulatory network, and the other for prediction of three missing links for an incomplete topology with 11 genes. An interesting point of this recent challenge is that participants are allowed to (virtually) buy several kinds but limited amount of experimental data that are generated in silico. The benchmark data provided in these projects are becoming useful to develop and evaluate new inference and analysis methods.