Inferring the Topology of Gene Regulatory Networks: An Algebraic Approach to Reverse Engineering
Brandilyn Stigler* and Elena Dimitrova†, *Southern Methodist University, Dallas, TX 75275-0156, USA, †Clemson University, Clemson, SC 29634-0975, USA, [email protected]
In everyday language the word “genes” has become synonymous with heredity. Nowadays children learn early that who we, and the living world around us, are is encoded in each and every cell of the organism. This, of course, is generally true.1 Indeed genes hold the information to build and maintain an organism’s cells and pass genetic traits to offspring. However, thinking of a genome simply as a book full of facts about an organism paints a rather incomplete static picture. It hides the fact that the “book” also contains instructions for the mechanisms through which genetic information is extracted and plays a role in cellular processes, such as controlling the response of a cell to environmental signals and replication of the DNA preceding the cell division. This process of genetic information extracting and utilizing is part of what is know as gene regulation. Gene regulation is essential for prokaryotic and eukaryotic cells, as well as for viruses, as it increases the cell’s flexibility in responding to the environment, allowing it to synthesize gene products (which are most often proteins) when needed.
Gene regulation is an intricate process whose complexity makes it an extreme challenge for mathematical modeling. For example, proteins synthesized from genes may control the flow of genetic information from DNA to mRNA, function as enzymes catalyzing metabolic reactions, or may be components of signal transduction pathways. Gene regulation also drives the processes of cellular differentiation and morphogenesis, leading to the creation of different cell types in multicellular organisms where the different types of cells may possess different gene expression profiles though they all possess the same genome sequence. The degradation of proteins and the immediate DNA products can also be regulated in the cell. The proteins involved in the regulatory functions are produced by other genes. This gives rise to a gene regulatory network (GRN) consisting of regulatory interactions between DNA, RNA, proteins, and small molecules. A simple GRN consists of one or more input genes, metabolic and signaling pathways, regulatory proteins that integrate the input signals, several target genes, and the RNA and proteins produced from target genes.
The first discovery of a GRN is widely considered to be the identification in 1961 of the lac operon, discovered by Jacques Monod [1], in which proteins involved in lac metabolism are expressed by Escherichia coli and some other enteric bacteria only in the presence of lactose and absence of glucose. Since its discovery, the lac operon has often been used as a model system of gene regulation. Most mathematical models of the lac operon have been given as systems of differential equations (see Chapter 2 for examples) but discrete modeling frameworks are increasingly receiving attention for their use in offering global insights [2–4]. For examples of discrete mathematical models of a GRN, see the Boolean network models of the lac operon in E. coli in Chapter 1.
Boolean networks are particularly useful in the case where one is interested in qualitative behavior. However, allowing each variable to assume only 0 and 1 as its values (often interpreted as ‘absent’ and ‘present’, or ‘off’ and ‘on’) does not always allow for sufficient flexibility in the model construction and can cause the loss of important information. Furthermore, few GRNs are as well understood as the lac operon, which prohibits model construction that is purely based on knowledge of the GRN. Technological advances in the life sciences, however, have triggered an enormous accumulation of experimental data representing the activities of the living cell. As a result, the type of mathematical modeling which starts from experimental data and builds a model with little additional information has become of interest. Such modeling methods are sometimes referred to as reverse engineering and are the topic of the following section.
Reverse engineering is an approach to mathematical modeling in which a model is constructed from observations of a system in response to stimulus. The process of reverse engineering is similar to playing the parlor game Twenty Questions, in which one tries to discover a secret word by asking yes/no questions. While one can ask all possible feature-type questions (is it alive, is it a physical object), one can typically get away with far fewer questions. In fact, one wins if fewer than 20 questions are asked. In molecular biology, a biological network, such as our immune system’s response to infection, plays the role of the secret word and laboratory experiments take the place of the questions. The goal then is to discover the network through the experiments with the hope of gaining deeper insight into a particular phenomenon, such as seasonal effects on immunity. For reasons of cost, it is advantageous to be able to discover the network with as few experiments as possible.
Reverse engineering is a critical step in the systems biology paradigm that has pervaded the biological sciences in recent history. The modeling process starts with a biological system under study. Given a question or hypothesis about the system, the researcher designs experiments to probe the system in hopes of addressing the question or hypothesis. A model is then built from the data resulting from the experiments. Predictions or simulations are extracted from the model and then compared against the original system. The step “from data to a model” is one that uses reverse engineering. This paradigm differs from the previous “reductionist” paradigm that dominated twentieth century biology: the parts of a system were of importance for study in reductionism, whereas the system itself is of interest in systems biology.
Generally speaking, reverse engineering in systems biology aims to recover the network topology and dynamics of a network from observations. A model is built to fit the given observations and the topology and/or the dynamics of the network can be inferred from the model. Network topology refers to the physical structure of the network, that is, how the components in the network are connected. It is often encoded as a directed graph, or a wiring diagram, where vertices represent the components of the network (genes, proteins, signaling molecules, etc.) and a directed edge is drawn between two vertices if there is an interaction between the associated components (regulation, synthesis, activation, etc.). Dynamics refer to the behavior of the network over time, that is the time evolution of the network processes. The dynamics of a network is also depicted as a graph; in the case of finite dynamical systems, which are considered here, the graph is finite and directed. The so-called state space graph is comprised of vertices representing network states (n-tuples) and a directed edge is drawn between two vertices A and B if the network advances from state A to state B. See Section 3.2 for definitions.
Gene regulatory networks are often modeled using collections of mathematical functions in which each molecular component is assigned a function that formalizes the dynamics of the component [5,6]. From these functions the wiring diagram and state space can be constructed and the structure and behavior of the GRN can be analyzed. A challenge for molecular geneticists is to identify causal links in the network. Identification and control of these links are important first steps in repairing defects in regulation. While there is a growing amount of data being collected from such networks, control of GRNs requires knowledge of the topology and dynamics of the GRN.
Mathematical methods to reverse engineer GRNs are diverse and draw from statistics, graph theory, network theory, computational algebra, and dynamical systems [7,8]. The performance of these methods intrinsically depends on the amount and quality of data provided [9]. In practice, there is insufficient data to uniquely infer a model for a GRN and the number of models that fit the data may be considerably large. An area of continual growth is the development of methods to select biologically feasible or likely models from a pool of candidate models. There are numerous strategies for model selection. For example, some methods restrict the space of likely wiring diagrams to those that have few inputs per vertex or whose in-degree distribution follows a power law, features which are consistent with what is believed about GRNs [10,11]. As it has also been observed that GRNs tend to be ordered systems with a limited number of regimes, there are methods which filter their results to only those state spaces with few steady states and short limit cycles (see Section 3.2 for definitions). Other aspects of GRNs, such as the negative or positive feedback loops and oscillatory behavior, are also considered in model selection.
Different mathematical frameworks have been proposed for the modeling of GRNs, with continuous models using differential equations being the most common approach. While the behavior of a biological system may be seen as continuous in that change of concentration of biochemicals can be modeled with continuous functions, the technology to record observations of the system is not continuous and the available experimental data consist of collections of discrete instances of continuous processes. Furthermore, discrete models where a variable (e.g., a gene) could be in one of a finite number of states are more intuitive, phenomenological descriptions of GRNs and, at the same time, require less data to build. The framework of finite dynamical systems (FDS) provides the opportunity to model a GRN as a collection of variables that transition discretely from one state to the next.
For each variable , its local function determines the state of in the next iteration based on the current state of all variables, including possibly itself. If we further require that the state set S be a finite field, then a result in [12] guarantees that the local functions of an FDS can be expressed as polynomial functions. Working exclusively with polynomials facilitates the modeling process significantly, as we shall see in later sections. Fortunately, from basic abstract algebra we know that a restriction on the size of S can turn it into a finite field by requiring that the cardinality of S be a power of a prime integer. An example of a finite field is the set of integers modulo p, denoted , where p is prime. The elements of are the equivalence classes of remainders upon division by p, namely , and addition and multiplication of these classes corresponds to addition and multiplication of representatives from the classes modulo p.
For simplicity, we will write m instead of to represent the elements of a finite field.
If the state set for an FDS F is a finite field , we call F a polynomial dynamical system (PDS).
A key issue in the study of GRNs is understanding the relationships among genes, proteins, metabolites, etc. in the network. Thus an important characteristic of a PDS is the way its variables are “connected.” To visualize the network topology, a graph called a wiring diagram is used, where the vertices are labeled by the PDS variables and the directed edges signify the direction of interaction between two variables.
Figure 3.1(a) shows the wiring diagram of the PDS from Example 3.2.
Figure 3.1 Wiring diagram (a) and state space graph (b) of the PDS from Example 3.2.
While the wiring diagram of a PDS provides only a static snapshot of the network topology with no reference to strength and timing of the interactions, such a graph carries very useful information about the biological system and several methods have been developed for its inference. Sections 3.5.1 and 3.5.2 present two such methods.
Just as a phase portrait is used to plot the simultaneous change of two or three variables in a system of differential equations [13], the dynamic properties of a PDS are captured in a graph called a state space graph. In fact, the dynamics of a PDS is uniquely represented by its state space graph, which has vertices.
The state space graph of the PDS from Example 3.2 is shown in Figure 3.1(b). For example, there is an arrow from state to state because using modulo 3 arithmetic (since the field is ). The graph consists of vertices arranged into three connected components. Notice that and . Such points are called fixed points and these two are the only fixed points for this PDS. Also notice that , or graphically, . This is called a limit cycle and its length is two. Analogously, for any positive integer m, a limit cycle of length m arises whenever (and m is the smallest positive integer with this property) for some . Fixed points can be considered as limit cycles of length one. For this PDS, is the only limit cycle of length larger than one.
In this section we introduce some basic concepts from computational algebra that we will need for the rest of the chapter. For a comprehensive introductory treatment of the subject some excellent sources are [19–21].
In linear algebra we learn how to solve systems of linear equations using Gaussian elimination. However, the algorithm cannot be easily adjusted for application to systems of polynomials of degree larger than one. Simultaneously solving systems of polynomial equations is needed in many areas but a systematic computational method was not developed until the 1960s when the so-called Gröbner bases were developed [22,23]. The role of a Gröbner basis is similar to that of a basis for a vector space and the algorithm for finding it can be seen as a multivariate, nonlinear generalization of Gaussian elimination for linear systems.
An issue that arises when using multivariable polynomials is that polynomial division is generally not well defined. Polynomial division in one variable is always well defined in the sense that there is a unique remainder upon division of polynomials. This is because there is a unique monomial ordering (see Definition 3.4) and the division of polynomial f by polynomial g proceeds by dividing the highest power of the variable in g into the highest power in f. In other words, the one-variable monomials are ordered using degree ordering:
With multivariate polynomials, however, there is more than one way of ordering their monomials (terms) and thus to carry out long division. So the remainder of long division is not uniquely determined in general.
As we will see below, multivariate polynomial division becomes well defined when using Gröbner bases.
Let k be a field. Here, we focus on the finite fields where p is prime; however, much of the presented results are valid for other fields, such as . The linear combination of monomials of the form over k, where is the n-tuple exponent with nonnegative entries, forms a ring, called a polynomial ring, denoted . For the purpose of polynomial division, it is necessary to be able to arrange the terms of a polynomial unambiguously in some order. This requires a total ordering on the monomials, i.e., for every pair of monomials and , exactly one of the following must be true:
Taking into account the properties of the polynomial sum and product operations, the following definition emerges.
To simplify the classification of monomial orderings, we will assume throughout that (or ).
One of the most popular monomial orderings is the lexicographic ordering (lex) which is analogous to the ordering of words in dictionaries and under which . Another one is the graded lexicographic ordering (grlex) which orders by total degree first, then “breaks ties” using the lexicographic ordering. Under graded lexicographic ordering, . Yet another very useful monomial ordering is the graded reverse lexicographic ordering (grevlex). Like grlex, it uses the total degree first but to break ties it looks at the rightmost variable and favors the smaller power. For example, but . Algorithmic definitions are available at [19].
A monomial ordering can also be defined by a weight vector in . We require that have nonnegative coordinates in order for 1 to always be the smallest monomial. Fix a monomial ordering , such as . Then, for , define if and only if
For example, with , weight vector orders the monomials of polynomial in the same way as the lexicographic order, while performs the same monomial ordering as the graded lexicographic ordering.
However, even after fixing the monomial ordering, there is still a problem with multivariate polynomial division: when dividing a given polynomial by more than one polynomial, the outcome may depend on the order in which the division is carried out. Let . The famous ideal membership problem is to determine whether there are polynomials such that .The following definition will help us state this in the language of abstract algebra and explains the name of the problem.
We know from abstract algebra that I is closed under addition and multiplication by any polynomial in . This is analogous to a subspace of a vector space, where the “vectors” are polynomials.
We ask whether f is an element of I. In general, even under a fixed monomial ordering, the order in which f is divided by the generating polynomials affects the remainder; perhaps even more surprisingly, a nonzero remainder of division does not imply . Moreover, the generating set of the ideal I is not unique but a special generating set can be selected so that the remainder of polynomial division of f by the polynomials in performed in any order is zero if and only if f lies in I. A generating set with this property is called a Gröbner basis. A Gröbner basis exists for very ideal other than and for a fixed monomial ordering. Before we give a precise definition, we need to look at a special kind of ideal.
A Gröbner basis for an ideal is necessarily a generating set for the ideal but may not be unique even under a fixed monomial ordering. However, if we also require that for any two distinct elements , no term of is divisible by , such a Gröbner basis is called reduced and is unique for an ideal and a monomial ordering , provided the coefficient of in g is 1 for each .
In this section, we consider the problem of constructing PDSs from a given set of data collected from a biological system on n components, called nodes.
Let be a finite field. Let be a set of m input states of an n-node system, and a set of corresponding output states, where . Each of and has the form
where states in bold are elements of and italicized states are elements of . We aim to find all PDSs over such that for each and where each . We proceed by considering the subproblem of finding all transition polynomials that satisfy
Let p be the (prime) characteristic of . The identity for each imposes a degree restriction on polynomials when viewed as functions, namely for . This equation gives rise to the relation , for each . Since we are interested in polynomials that can be treated as functions, the local functions are actually polynomials in the quotient ring . This ring contains polynomials that are the remainders of the elements of upon division by the degree relations, that is, polynomials whose degree in any variable is less than p. In essence, the quotient ring is constructed by “modding out” by the elements of the “divisor” ideal. Thus we aim to find all polynomials that map each system state to the node state .
Since the remaining discussion assumes that the coefficient field has positive characteristic, we can simplify the notation. We will denote the quotient polynomial ring simply as and elements in the ring as polynomials with usual representation. That is, while an element of the quotient ring is of the form , we will instead write it as , understanding that all ring elements have been reduced modulo the ideal .
Let be the set of all polynomials that fit the data for node j as described above. We solve the problem similarly to solving a system of nonhomogeneous linear equations in that we construct a “particular” polynomial and the set of “homogeneous” polynomials. The “particular” polynomial is any that interpolates or fits the given data. There are numerous methods for constructing an interpolating polynomial function. Here we use a formula based on the ring-theoretic version of the Chinese Remainder Theorem:
where is a polynomial that evaluates to 1 on and 0 on any other input, and represents . Specifically,
where is the first coordinate in which and . For a detailed description of the construction, see [25]. For the set of “homogeneous” polynomials, we use the Ideal-Variety Correspondence from algebraic geometry [19]. We view the input data as a variety, or a set of roots of a system of polynomials equations. The Ideal-Variety Correspondence states how to construct the ideal of polynomials that vanish on the data. To each input state , for , we associate the (maximal) ideal
Each ideal is the set of polynomials that evaluate to 0 on the single data point . According to the correspondence, the intersection of these ideals contains all polynomials that vanish on the union of the input data. So the ideal of “homogeneous”, or vanishing, polynomials is
which we denote by I for simplicity. Therefore, the set is described by .
The data set in the previous example is called a time series, since the data points are considered to be successive. We call the model space associated to the given data and any PDS is called a model. Note that the ideal I of vanishing polynomials is independent of the node or particular function.
For the following exercises, consider the time series over , which contains five observations for a 3-node system.
1 | 1 | 1 |
2 | 0 | 1 |
2 | 0 | 0 |
0 | 2 | 2 |
0 | 2 | 2 |
As we saw in the previous section, the model space consists of all PDSs that fit a given data set. The problem is now to select models from the space.
There are numerous strategies for model selection. One naive approach is to simply choose the interpolating polynomial constructed above. The problem, however, is that predictions of the model are dependent on the form of the model. Specifically, the terms that appear in the polynomials will “predict” the interactions among the nodes as represented in the wiring diagram associated to the PDS. While it is possible that the chosen interpolation formula well characterizes the physical setting being modeled, it is more likely that the resulting model will have little to no predictive power. Therefore, strategies that incorporate appropriate priors are desirable (for example, see [27,28]).
It is believed that GRNs are sparse, in the sense that they have few (less than half of all possible) connections, and their edges follow a power-law distribution [10,11]. A model selection strategy that makes use of these assumptions is one based on Gröbner bases. Given a set of transition functions for a node, the method finds a polynomial in the set that contains no terms in I. This is done by computing the remainder of upon division by the elements in I. As seen in Section 3.3, polynomial division in general is not well defined in the sense that remainders may not be unique, but is well defined given a Gröbner basis. The strategy requires that a Gröbner basis for I is computed. Then the process of computing the remainder of upon division by the elements of I results in a unique polynomial called the normal form of with respect to and is denoted . Given a monomial ordering and a Gröbner basis for I with respect to , then is a model that fits the given data and is reduced with respect to the data.
Since Gröbner bases depend on the choice of monomial ordering, so do resulting models.
To see the explicit Gröbner basis that is being used, you can type
which returns the generators of the Gröbner basis as a matrix. To get the generators in list form, type
We saw in the previous section that the choice of monomial ordering affects which model is selected from the model space. One way to minimize the effect of this choice is to minimize the number of variables in the ambient ring. We again restrict our attention to finding functions for a node . Consider the data set as above and let be the set of polynomials that fit the data. Next we describe the Minimal Sets Algorithm (MSA) which finds the smallest sets of variables such that a polynomial in in those variables exists (see [18] for more details).
Minimal Sets Algorithm:
2. Partition the input states according to the output values . For each output value , let . Then the partition is given by .
3. Compute the square-free monomial ideal M generated by the monomials for all and in which . The monomials encode the coordinates in which the input states and differ.
4. Compute the primary decomposition of M; that is, write M as the irredundant intersection of primary ideals.
5. Return the generating sets of the associated minimal prime ideals .
The generating sets of the ideals are the sought after minimal sets (see Corollary 4 of [18]).
Essentially what the MSA does is it computes the intersection of all possible “minimal” wiring diagrams. In the above example, the minimal sets for are , and which indicates that ALL minimally described functions which fit the data for will be in terms of just one variable.
As the above example demonstrates, there are typically many minimal sets for a node. Next we introduce a method to score minimal sets, based on the sparseness assumption.
Let be the set of minimal sets for a node . Define to be the number of sets in of cardinality s and to be the number of sets X in of cardinality s such that . Then we score each variable using the function
and each set using
We can obtain a probability distribution on by dividing the set scores by . The best performing minimal set is the one with the highest normalized set score.
For other scoring strategies, see [18]. Note that the strategy presented here corresponds to the method in that paper.
Recall from Section 3.5 that Gröbner bases depend on the choice of monomial ordering and so do resulting models. In order to avoid ambiguity in model selection, we would like to be able to (1) identify all monomial orderings that cause the Buchberger algorithm to produce different reduced Gröbner bases; (2) study the effect of these monomial orderings on the model dynamics; and (3) select the best model according to some criteria.
A combinatorial structure that contains information about all reduced Gröbner bases of an ideal, and thus fulfills item (1) from the above list, is the Gröbner fan of an ideal. It is a polyhedral complex of cones, each corresponding to an initial ideal, which is in a one-to-one correspondence with the marked reduced Gröbner bases of the ideal, where “marked” refers to the initial term of each generating polynomial being distinguished. Here we give a brief introduction to the concept of the Gröbner fan. For details see, for example, [29].
A fixed polynomial ideal has only a finite number of different reduced Gröbner bases. Informally, the reason is that most of the monomial orderings only differ in high degree and the Buchberger algorithm for Gröbner basis computation does not “see” the difference among them. However, they may vary greatly in the number of polynomials and shape. In order to classify them, we first present a convenient way to define monomial orderings using vectors. Again, we think of a polynomial in as a linear combination of monomials of the form over a field k, where is the n-tuple exponent .
We are now almost ready to give a definition of the Gröbner fan. We first define its building blocks, the Gröbner cones.
The collection of all cones for a given ideal is the Gröbner fan of that ideal. The cones are in bijection with the marked reduced Gröbner bases of the ideal. Since reducing a polynomial modulo an ideal I, as the reverse engineering algorithm requires, can have at most as many outputs as the number of marked reduced Gröbner bases, it follows that the Gröbner fan contains information about all Gröbner bases (and thus all monomial orderings) that need to be considered in the process of model selection. For an example of utilizing the information encoded in the Gröbner fan of an ideal for reverse engineering of PDSs, see [30].
There are algorithms based on the Gröbner fan that enumerate all marked reduced Gröbner bases of a polynomial ideal. An excellent implementation of such an algorithm is the software package Gfan [31].
The software packages Gfan and others implement a special type of Gröbner basis conversion, known as the Gröbner walk [32]. Unless, however, the dimension of the fan is low or the number of its cones happens to be small, computing the entire fan is computationally expensive and some of its components do not have polynomial bounds. The following is an example that illustrates how infeasible computing the entire Gröbner fan could be.
Fortunately, in order to compute polynomial normal forms, the only information that we need to extract from the Gröbner cones of a fan is their corresponding reduced Gröbner bases and/or their relative volumes, where “relative” refers to the cone volume when bounded by an n-sphere centered at the cone’s vertex. While computing the exact relative cone volumes requires knowing the facets of the fan, that is, the fan itself, approximation of the relative volumes in many cases is sufficient [33]. A stochastic method for estimating the relative volumes of the Gröbner cones of a Gröbner fan without computing the actual fan, as well as a Macaulay 2 implementation for uniform sampling from the Gröbner fan, is presented in [34].
For reasons explained at the beginning of Section 3.2, we have been assuming that the experimental data we use for reverse engineering have already been discretized into a (small) finite number of states. Typically, however, experimental measurements come to us represented by computer floating point numbers and consequently data discretization is in fact part of the modeling process and can be viewed as a preprocessing step. We will use the definition of discretization presented in [35].
Without loss of generality, assume that is sorted, i.e., for all . Spanning discretizations of degree D satisfy the additional property that the smallest element of is equal to 0 and that the largest element of is equal to .
There is no universal way for data discretization that works for all data sets and all purposes. Sometimes discretization is a straightforward process. For example, if a gene expression time series has a sigmoidal shape, e.g., , it is reasonable to discretize it as . More complicated expression profiles may be easy to discretize too and it is often true that the human eye is the best discretization “tool” whose abilities to discern patterns cannot be reproduced by any software. Large data sets, on the other hand, do require some level of automatization in the discretization process. Regardless of the particular situation, it is good practice to look at the data first and explore for any patterns that may help with the discretization before inputting the data into any discretization algorithm. Afterwards, the way you choose to discretize your data, which includes selecting the number of discrete states, should depend on the type and amount of data and the specific reason for discretization. Below we present several possible approaches which by no means comprise a complete list.
Binary discretizations are the simplest way of discretizing data, used, for instance, for the construction of Boolean network models for gene regulatory networks [36,37]. The expression data are discretized into only two qualitative states as either present or absent. An obvious drawback of binary discretization is that labeling real-valued data according to a present/absent scheme may cause the loss of large amounts of information.
Interval discretizations divide the interval into k equally sized bins, where k is user-defined. Another simple method is quantile discretization which places (possibly duplicated) values in each bin [38]. Any method based on those two approaches would suffer from problems that make it inappropriate for some data sets. Interval discretizations are very sensitive to outliers and may produce a strongly skewed range [39]. In addition, some discretization levels may not be represented at all which may cause difficulties with their interpretation as part of the state space of a discrete model.
On the other hand, quantile discretizations depend only on the ordering of the observed values of and not on the relative spacing values. Since distance between the data points is often the only information that comes with short time series, losing it is very undesirable. A shortcoming, common for both interval and quantile, as well as for most other discretization methods, is that they require the number of discrete states, k, to be user-provided.
A number of entropy-based discretization methods deserve attention. An example of those is Hartemink’s Information-Preserving Discretization (IPD) [35]. It relies on minimizing the loss of pairwise mutual information between each two real-valued vectors (variables). The mutual information between two random variables X and Y with joint distribution and marginal distributions and is defined as
Note that if X and Y are independent, by definition of independence , so . Unfortunately, when modeling regulatory networks and having as variables, for instance, mRNA, protein, and metabolite concentrations, the joint distribution function is rarely known and it is often hard to determine whether two variables are independent.
Another family of discretization techniques is based on clustering [40]. One of the most often used clustering algorithms is the k-means [41]. It is a non-hierarchical clustering procedure whose goal is to minimize dissimilarity in the elements within each cluster while maximizing this value between elements in different clusters. Many applications of the k-means clustering such as the MultiExperiment Viewer [42] start by taking a random partition of the elements into k clusters and computing their centroids. As a consequence, a different clustering may be obtained every time the algorithm is run. Another inconvenience is that the number k of clusters to be formed has to be specified in advance. Although there are methods for choosing “the best k” such as the one described in [43], they rely on some knowledge of the data properties that may not be available.
Another method is single-link clustering (SLC) with the Euclidean distance function. SLC is a divisive (top-down) hierarchical clustering that defines the distance between two clusters as the minimal distance of any two objects belonging to different clusters [40]. In the context of discretization, these objects will be the real-valued entries of the vector to be discretized, and the distance function that measures the distance between two vector entries v and w will be the one-dimensional Euclidean distance . Top-down clustering algorithms start from the entire data set and iteratively split it until either the degree of similarity reaches a certain threshold or every group consists of one object only. For the purpose of data analysis, it is impractical to let the clustering algorithm produce clusters containing only one real value. The iteration at which the algorithm is terminated is crucial since it determines the degree of the discretization. SLC with the Euclidean distance function has one major advantage: very little starting information is needed (only distances between points). In addition, being a hierarchical clustering procedure it lends itself to adjustment in case that clusters need to be split or merged. It may result, however, in a discretization where most of the points are clustered into a single partition if they happen to be relatively close to one another. Another problem with SLC is that its direct implementation takes D, the desired number of discrete states, as an input. However, we would like to choose D as small as possible, without losing information about the system dynamics and the correlation between the variables, so that an essentially arbitrary choice is unsatisfactory.
In [44], a hybrid method for discretization of short time series of experimental data into a finite number of states was introduced. It is a modification of the SLC algorithm: it begins by discretizing a vector in the same way as SLC does but instead of providing D as part of the input, the algorithm contains termination criteria which determine the appropriate value of D. After that each discrete state is checked for information content and if it is determined that this content can be considerably increased by further discretization, then the state is separated into two states in a way that may not be consistent with SLC.
If more than one vector is to be discretized, the algorithm discretizes each vector independently. (For details on multiple vector discretization, see [44]). If the vector contains m distinct entries, a complete weighted graph on m vertices is constructed, where a vertex represents an entry and an edge weight is the Euclidean distance between its endpoints. The discretization process starts by deleting the edge(s) of highest weight until the graph gets disconnected. If there is more than one edge labeled with the current highest weight, then all of the edges with this weight are deleted. The order in which the edges are removed leads to components, in which the distance between any two vertices is smaller than the distance between any two components, a requirement of SLC. We define the distance between two components G and H to be .The output of the algorithm is a discretization of the vector, in which each cluster corresponds to a discrete state and the vector entries that belong to one component are discretized into the same state.
Having disconnected the graph, the next task is to determine if the obtained degree of discretization is sufficient; if not, the components need to be further disconnected in a similar manner to obtain a finer discretization. A component is further disconnected if, and only if, both (1) and (2) below hold:
1. The minimum vertex degree of the component is less than the number of its vertices minus 1. The contrary implies that the component is a complete graph by itself, i.e., the distance between its minimum and maximum vertices is smaller than the distance between the component and any other component.
2. One of the following three conditions is satisfied (“disconnect further” criteria):
a. The average edge weight of the component is greater than half the average edge weight of the complete graph.
b. The distance between its smallest and largest vertices is greater than or equal to half this distance in the complete graph. For the complete graph, the distance is the graph’s highest weight.
c. Finally, if the above two conditions fail, a third one is applied: disconnect the component if it leads to a substantial increase in the information content carried by the discretized vector.
The result of applying only the first two criteria is analogous to SLC clustering with the important property that the algorithm chooses the appropriate level to terminate. Applying the third condition, the information measure criterion may, however, result in a clustering which is inconsistent with any iteration of the SLC.
1. Jacob F, Monod J. Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol. 1961;3(3):318–356.
2. Setty Y, Mayo A, Surette M, Alon U. Detailed map of a cis-regulatory input function. PANS. 2003;100(13):7702–7707.
3. Mayo A, Setty Y, Shavit S, Zaslaver A, Alon U. Plasticity of the cis-regulatory input function of a gene. PLoS Biology. 2006;4(4):0555–0561.
4. Stigler B, Veliz-Cuba A. Network topology as a driver of bistability in the lac operon. J Comput Biol. 2008;18(6):783–794.
5. de Jong H. Modeling and simulation of genetic regulatory systems: a literature review. J Comput Biol. 2002;9(1):67–103.
6. Bornholdt S. Less is more in modeling large genetic networks. Science. 2005;310:449–450.
7. Brazhnik P, de la Fuente A, Mendes P. Gene networks: how to put the function in genomics. Trends Biotechnol. 2002;20(11):467–472.
8. Gardner T, Faith J. Reverse-engineering transcription control networks. Phys Life Rev. 2005;2:65–88.
9. Bansal M, Belcastro V, Ambesi-Impiombato A, di Bernardo D. How to infer gene networks from expression profiles. Mol Sys Biol. 2007;3(78):1–10.
10. Burda Z, Krzywicki A, Martin OC, Zagorski M. Distribution of essential interactions in model gene regulatory networks under mutation-selection balance. Phys Rev E. 2010;82.
11. Yeh H-Y, Cheng S-W, Lin Y-C, Yeh C-Y, Lin S-F, Soo V-W. Identifying significant genetic regulatory networks in the prostate cancer from microarray data based on transcription factor analysis and conditional independency. BMC Med Genom. 2009;2.
12. Lidl R, Niederreiter H. Finite fields Encyclopedia of mathematics and its applications. vol. 20 Cambridge University Press 1997; p. 369.
13. Strogatz S. Non-linear dynamics and chaos: with applications to physics, biology, chemistry and engineering. Perseus Books 2000.
14. Termanini A, Tieri P, Franceschi C. Encoding the states of interacting proteins to facilitate biological pathways reconstruction. Biol Direct. 2010;5.
15. MacLean D, Studholme DJ. A boolean model of the pseudomonas syringae hrp regulon predicts a tightly regulated system. PLoS ONE. 2010;5.
16. Colon-Reyes O, Laubenbacher R, Pareigis B. Boolean monomial dynamical systems. Ann Comb. 2004;8:425–439.
17. Veliz-Cuba A, Jarrah A, Laubenbacher R. Polynomial algebra of discrete models in systems biology. Bioinformatics. 2010;26(13):1637–1643.
18. Jarrah A, Laubenbacher R, Stigler B, Stillman M. Reverse-engineering of polynomial dynamical systems. Adv Appl Math. 2007;39(4):477–489.
19. Cox D, Little J, O’Shea D. Ideals, varieties, and algorithms. New York: Springer-Verlag; 1997.
20. Adams WW, Loustaunau P. An introduction to Gröbner bases. Graduate studies in Math. vol. III American Mathematical Society 1994.
21. Eisenbud D. Introduction to commutative algebra with a view towards algebraic geometry Graduate Texts in Mathematics New York: Springer; 1995.
22. Buchberger B. An algorithm for finding a basis for the residue class ring of a zero-dimensional polynomial ideal. PhD Thesis, University of Innsbruck; 1965.
23. Hironaka H. Resolution of singularities of an algebraic variety over a field of characteristic zero: I. Ann Math. 1964;79(1):109–203.
24. Buchberger B, Möller HM. The construction of multivariate polynomials with preassigned zeros. In: Springer 1982;24–31. Calmet J, ed. Proceedings of the European computer algebra conference EURO CAM’82 LNCS. vol. 144.
24. Buchberger B, Möller HM. The construction of multivariate polynomials with preassigned zeros. In: Calmet J, editor. Proceedings of the European computer algebra conference EURO CAM’82. LNCS, vol. 144. Springer; 1982. p. 24–31.
25. Laubenbacher R, Stigler B. A computational algebra approach to the reverse engineering of gene regulatory networks. J Theor Biol. 2004;229:523–537.
26. Grayson D, Stillman M. Macaulay 2, a software system for research in algebraic geometry; 2006. http://www.math.uiuc.edu/Macaulay2.
27. Gustafsson M, Hornquist M, Lombardi A. Constructing and analyzing a large-scale gene-to-gene regulatory network Lasso-constrained inference and biological validation. IEEE/ACM Trans Comput Biol Bioinform. 2005;2(3):254–261.
28. Thorsson V, Hornquist M, Siegel A, Hood L. Reverse engineering galactose regulation in yeast through model selection. Stat Appl Genet Mol Biol. 2005;4(1):1–22.
29. Mora T, Robbiano L. Gröbner fan of an ideal. J Symb Comput. 1988;6(2/3):183–208.
30. Dimitrova ES, Jarrah AS, Laubenbacher R, Stigler B. A Gröbner fan method for biochemical network modeling. In: Proceedings of the international symposium on symbolic and algebraic computation (ISSAC). ACM 2007;122–126.
31. Fukuda K, Anders J, Thomas R. Computing Gröbner fans. Math Comput. 2007;76(260):2189–2212.
32. Collart S, Kalkbrener M, Mall D. Converting bases with the Gröbner walk. J Symb Comput. 1997;24:465–469.
33. Dimitrova ES, Garcia-Puente LD, Hinkelmann F, et al. Parameter estimation for Boolean models of biological networks. J Theor Comput Sci. 2011;412:2816–2826.
34. Dimitrova ES. Estimating the relative volumes of the cones in a Gröbner fan. Math Comput Sci. 2010;3(4):457–466.
35. Hartemink A. Principled computational methods for the validation and discovery of genetic regulatory networks. PhD Dissertation, Massachusetts Institute of Technology; 2001.
36. Kauffman SA. Metabolic stability and epigenesis in randomly constructed genetic nets. J Theor Biol. 1969;22:437–467.
37. Albert R, Othmer H. The topology of the regulatory interactions predict the expression pattern of the segment polarity genes in Drosophila melanogaster. J Theor Biol. 2003;223:1–18.
38. Dougherty J, Kohavi R, Sahami M. Supervised and unsupervised discretization of continuous features. In: Prieditis A, Russell S, editors. Machine learning: proceedings of the 12th International Conference; 1995.
39. Catlett J. Megainduction: machine learning on very large databases. PhD Dissertation. University of Sydney. 1991.
40. Jain A, Dubes R. Algorithms for Clustering Data. Prentice Hall 1988.
41. MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium of mathematical statistics and probability, vol. 1; 1967. p. 281–97.
42. Saeed A, Sharov V, White J, et al. TM4: a free, open-source system for microarray data management and analysis. BioTechniques. 2003;34(2):374–378.
43. Crescenzi M, Giuliani A. The main biological determinants of tumor line taxonomy elucidated by of principal component analysis of microarray data. FEBS Lett. 2001;507:114–118.
44. Dimitrova ES, McGee J, Laubenbacher R, Vera Licona P. Discretization of time series data. J Comput Biol. 2010;17(6):853–868.
1If we overlook the distinction between genes and their alleles and the fact that some organelles are self-replicating and are not coded for by the organism’s DNA.