Marco S. Nobile
Dipartimento di Informatica, Sistemistica e Comunicazione,
Università degli Studi di Milano-Bicocca, Milano, Italy
SYSBIO Centre for Systems Biology, Milano, Italy
Davide Cipolla
Dipartimento di Informatica, Sistemistica e Comunicazione,
Università degli Studi di Milano-Bicocca, Milano, Italy
Paolo Cazzaniga
Dipartimento di Scienze Umane e Sociali, Università degli Studi di Bergamo,
Bergamo, Italy
SYSBIO Centre for Systems Biology, Milano, Italy
Daniela Besozzi
Dipartimento di Informatica, Università degli Studi di Milano, Milano, Italy
SYSBIO Centre for Systems Biology, Milano, Italy
The goal of synthetic biology (SB) is to design and construct novel biological circuits—in particular, gene regulatory networks (GRNs)—that are able to reproduce a desired behavior. This task is similar to the reverse engineering (RE) problem [10], whose purpose is to identify the network of interactions among the components of a real biological system, that fits an experimentally observed dynamics. The main difference between RE and the design of a synthetic circuit is that, in RE, the target dynamics is not a specifically chosen behavior, but it usually consists in laboratory measurements of some chemical species. In this context, mathematical models and computational analysis of GRNs are needed to facilitate the experimental research and to provide useful insights for the control of gene interactions.
GRNs are traditionally modeled by means of high-level formalisms, in which the interaction of genes is expressed in terms of promotion and inhibition mechanisms. An example is represented by S-systems [65], namely, systems of nonlinear ordinary differential equations (ODEs) in which gene expression is modeled by power-law functions. S-systems models of GRNs are able to capture the intrinsic nonlinearity of gene expression, providing a good description of the behavior of the corresponding artificial gene regulation system [53]. In the context of SB, the inference of S-systems was tackled by means of evolutionary computation (EC) [3, 9, 26, 53], where a population of candidate solutions iteratively evolves under the pressure of a fitness function. The EC technique traditionally applied to this problem is genetic programming (GP) [31], in which candidate GRNs are encoded by complex recursive data structures like derivation trees or LISP s-expressions. Another EC methodology that was exploited for the inference of S-systems is differential evolution [66]; however, this method suffers from the additional problem related to overfitting solutions, which must be tackled during the evolution of the typical sparse structure of GRNs.
S-systems provide a valuable formalism for the modeling of GRNs, but they are not powerful enough to describe the actual mechanisms allowing a GRN to express a certain dynamics, so that these models have a low predictive capability. In order to provide biologists with additional information regarding gene regulation mechanisms, GRNs can be modeled as mechanistic reaction-based models (RBM), which describe in detail the molecular interactions among the chemical species. An advantage of using RBMs is that they can be easily exploited to analyze and predict emerging dynamics of GRNs in different conditions, thanks to several existing simulation tools [25, 50, 52].
In this chapter, we call evolutionary design (ED) the problem of automatically deriving an RBM that is able to reproduce a desired behavior, by exploiting EC methods only. The ED of RBMs can be more complicated than the ED of S-systems, because RBMs also require a proper description of the stoichiometry of reagents and products in each biochemical reaction. Furthermore, as in the case of S-systems, a correct kinetic parameterization of all reactions is needed, in order to produce a reliable simulation of the system dynamics. The identification of a plausible set of kinetic parameters, which is known as the parameter estimation (PE) problem [10], can be tackled by means of EC as well [49, 44]. Many EC algorithms were successfully applied to PE: in Ref. [13], it was empirically shown that particle swarm optimization (PSO) [29] is the most efficient algorithm, as also suggested in Ref. [4], where PSO and genetic algorithms (GAs) were compared.
In this chapter, we introduce a computational strategy to infer an RBM that specifically represents a gene regulation model (GRM) characterized by some predefined behavior. In particular, we present a two-level ED methodology, named cuGENED, which integrates two EC algorithms: Cartesian genetic programming (CGP) [43] and PSO [29]. CGP exploits individuals encoded as fixed-length vectors of integer numbers which, in contrast to standard GP, are mapped onto directed graphs rather than derivation trees [43]. The choice of CGP for the ED process is motivated by the fact that graphs are suitable for the representation of a network, and the mapping between a CGP individual and the corresponding GRM allows a direct translation of the individual into a human-comprehensible set of chemical reactions. Moreover, since CGP exploits fixed-length individuals, it does not need any explicit strategy to avoid bloating, that is, the uncontrolled growth of the size of solutions that usually occurs in GP.
cuGENED exploits CGP to derive the network of biochemical reactions which describe the genetic regulation mechanisms in a GRM. Then, PSO is exploited to estimate the kinetic parameters of the GRMs. The target of the overall evolutionary process is represented by some representative dynamics of a predefined number of genes that participate in the GRN. Eventually, the effective behavior of the inferred model has to be validated after the synthetic engineering of the genetic circuit. Such validation might be done by measuring, for example, the transcription levels of the genetic components of the circuit.
It is worth noting that, for the ED of synthetic GRNs, it is not possible to consider any network structure as target of the optimization process, so that the quality of a candidate GRM can be only evaluated by comparing its simulated dynamics against the desired target behavior. Indeed, an absolute novelty of cuGENED with respect to the state-of-the-art is that it allows 0-knowledge inference since the number of intermediate chemical species occurring in the GRM is usually unknown a priori.
In cuGENED, every generation of CGP requires the execution of PSO, which realizes the PE for each individual. Since PSO is a population-based algorithm as well, the whole methodology is computationally expensive. Nevertheless, all fitness evaluations in each iteration of the PSO are independent and can be accelerated by means of a parallel architecture. Among the existing technologies, one of the most efficient is the general-purpose GPU computing, in which the huge computational power of modern video cards is exploited for general-purpose computation. In the recent years, the adoption of graphics engines experienced a great boost in bioinformatics, systems biology and computational biology [12, 24, 57], where central processing units (CPUs) traditionally represented the standard workhorses. As a matter of fact, when several batches of simulations need to be executed—as in the case of the ED process—the necessary computing power can rapidly overtake the capabilities of standard desktop computers, therefore requiring high-performance computing solutions. To solve this problem, in this chapter, we consider cupSODA [50], a GPU-powered deterministic simulator of biochemical systems modeled by means of RBMs. cupSODA, exploited during the PE phase, is fully integrated in cuGENED and allows a strong reduction of the overall running time.
The chapter is structured as follows. In Section 6.2, we briefly present some EC methods previously used for the inference of biochemical models, using different assumptions for the description of kinetic rates. In Section 6.3, we describe the formalization of GRMs by means of mass-action-based models, and give a brief introduction of the EC techniques (CGP and PSO) used in the ED of GRNs. Then, we briefly explain the GPU computing framework exploited to speed up the optimization process. In Section 6.4, we provide a detailed description of our ED methodology to automatically derive GRMs. In Section 6.5, we present the results of the application of cuGENED for the automatic design of GRNs consisting in two and three genes. We discuss the whole methodology in Section 6.6, and in Section 6.7, we conclude with some final remarks and future developments.
To define a mathematical model of a biochemical system, the inference of the network of unknown reactions can be performed using EC. Historically, the first example was proposed in 2001 by Koza et al., who used GP to evolve a huge population (100,000 individuals) of putative biochemical systems [32]. Specifically, each individual is composed of multiple trees representing chemical reactions. In order to assess the fitness values, individuals are converted into analog electrical circuits (e.g., chemical concentrations are represented by voltages, reactions are represented by analog components) and simulated by means of an extended version of the SPICE simulation tool kit for electrical circuits [61]. Authors showed the feasibility of this methodology by reverse engineering metabolic pathways. The huge computational complexity of the method was then discussed in Ref. [33], concerning the parallelization of the computations on a 1000-nodes cluster system.
Similarly, Sugimoto et al. proposed a GP-based methodology for the RE of biochemical systems [67], which derives ODE-based models not restricted to mass-action kinetics. Indeed, their approach allows the inference of any arbitrary kinetics (e.g., Hill functions or Michaelis-Menten [46]). Because of this freedom, the method is prone to overfitting. In order to mitigate this problem, the fitness function was extended with a “parsimony” term: it is proportional to the complexity of the derivation tree and introduces a selective pressure on the population. Their results showed the successful inference of a single reaction in a biochemical system.
A similar GP-based approach was proposed by Ando et al. [3]. In this work, each individual is defined as a set of trees, which represent the right-hand side of the ODEs associated with each chemical species in the system. Differently from Ref. [67], they showed a successful RE of a metabolic pathway characterized by two reactions and three chemical species. Similar to Refs. [32, 67], this method does not allow the inference of models that are strictly based on mass-action kinetics.
A different approach for RE was proposed by Nummela and Briant [54], in which a population of individuals modeled as Petri nets [60] is evolved by means of GAs, extending the previous work by Kitagawa and Iba [30]. Nummela and Briant exploited a pseudo-first-order approximation, so that the Michaelis–Menten kinetics was simplified to a first-order rate equation. This simplification can be done thanks to the steady-state approximation on the concentrations of the intermediate complexes, and under the assumption that the equilibrium constant for the reversible formation of the complex is larger than the substrate concentration. Similar to Ref. [67], the bloating of the solutions is mitigated by introducing an additional factor in the fitness function, proportional to the number of transitions in the putative Petri net. Since the method is based on a GA, it is not able to simultaneously evolve the network of reactions and their kinetic parameters: the latter were optimized by means of stochastic hill-climbing during the fitness evaluation phase. Even though the method failed in identifying the correct target networks, it was shown to propose putative pathways whose dynamics is similar to the target dynamics. Nobile et al. proposed an alternative EC methodology for the robust evolution of populations of Petri nets: the evolutionary Petri nets (EPNs) [47]. Conceptually similar to GP, EPNs are characterized by novel genetic operators for the crossover and mutation of individuals. For the specific application to the RE of biochemical systems, though, EPNs require a separate PE methodology for the inference of kinetic parameters.
The strict decoupling of the network inference and PE is also exploited by Lenser et al. [35] and Nobile et al. [51]. In the first work, the authors exploited a custom evolutionary algorithm that separates the evolution of the network (performed with a GP-like approach) and the parameters inference (executed with an evolution strategy algorithm), showing that this methodology prevents a premature convergence. In the second work, RE was performed by means of a specific class of GP, the CGP [43], and parameters were estimated using PSO [29]. CGP allows to evolve individuals that can be represented as sets of expressions: in the case of RE, they can be interpreted as chemical reactions. Thus, this approach allows to directly evolve systems of reactions describing the biochemical network, under the assumption that all reactions have mass-action kinetics. In so doing, no arbitrary kinetics function can be introduced, avoiding the overfitting of solutions. In Ref. [51], PSO was used for PE, as it was previously shown to represent the best option to tackle this kind of problem [4, 13].
Given a biochemical system η consisting of some molecular species (e.g., genes, proteins, metabolites) and their mutual interactions, a mechanistic reaction-based model (RBM) of η can be formally defined by specifying the set of different molecular species occurring in η and the set of biochemical reactions. A reaction , μ = 1, …, M, is usually given in the form Rμ: a1S1 + ⋅⋅⋅ + aNSN → b1S1 + ⋅⋅⋅ + bNSN, where are called the stoichiometric coefficients of Rμ. The species occurring on the left-hand (right-hand) side of Rμ are called reagents (products, respectively). A numerical value is associated with Rμ, representing the rate constant of that reaction.
By considering the values of the stoichiometric coefficients of the reactions in , we can identify some particular reactions:
In this chapter, we are interested, in particular, in reaction-based gene regulation models (GRMs), which formally describe the biochemical reactions involved in gene transcription and translation. To properly define these reactions, we assume that the set of molecular species is given by the union of two disjoint sets of species, which we denote as and , for some . We assume that |Γ| |Σ|, where | · | represents the cardinality of the set.
The set Γ represents the genes that are available for the synthetic engineering of gene regulation circuits. We assume that each element γi in Γ is strictly associated with a messenger RNA (mRNA), the product of gene transcription, whose expression can be experimentally evaluated through cutting-edge technologies (e.g., microarray [20], qRT-PCR [34]). Thus, the cardinality of Γ is equivalent to the number of target time series used in cuGENED. By abuse of notation, we will identify γi either with the “gene” or with its “transcription product”, according to the context.
The set Σ represents a set of generic species, as well as their mutual chemical complexes, that are related to the processes of gene expression. These species are the actual effectors of gene regulation, and need to be included in any GRM to evaluate their influence on the emerging behavior of the circuit. An element σi ∈ Σ might represent any kind of gene product (e.g., protein) related to γi, or the molecular complex formed by the interaction of the gene product (acting as promoting/inhibiting transcription factor) with another gene γj ∈ Γ.
To better clarify the meaning of the sets Γ and Σ, we illustrate some types of reactions that formally describe the processes of gene regulation and that might be present in a GRM, along with other generic reactions mentioned above.
Reactions that represent gene expression (transcription/translation) can be written in the form γi → γi + σi. Given a gene γj ∈ Γ, its regulation by means of gene γi can be described either as γi + γj → σk or as σi + γj → σk, with i not necessarily distinct from j. Here, the species σk might represent two different elements: (1) the product of the regulation of γj (e.g., the protein encoded by γj); in this case we can simply say that σk is equal to σj; (2) the chemical complex between gene γj and its own regulator; in this case σk is a compact formalization for the species γiγj or σiγj. The inverse reactions of the type σk → σi + γj can then be used to describe the dissociation of the regulator σi from gene γj. The effective expression of a regulated gene γj—that takes place after the occurrence of some reactions of type σi + γj → σk—can be synthetically written in the form σk → σj. This reaction states that species σj, related to gene γj, is derived from species σk, which represents an intermediate molecular complex having γj as element (e.g., σk = γiγj or σk = σiγj, as mentioned above). Finally, reactions of the form σh → σg can also be used to represent the degradation, or any other generic transformation reaction (e.g., post-translational modification of a protein) of the species occurring in Σ.
The RE process of a given biological system consists in the automatic identification of the network of interactions among the molecular components of that system. In this task, a set of experimental time series measurements of some species occurring in the system can be exploited as target to drive the evolutionary inference of the network [51]. On the contrary, in this chapter, we assume that only an expected dynamical behavior of the network is given as target. Quite obviously, we assume that no specific laboratory measurements of the temporal evolution of chemicals that will constitute the circuit can be given before the circuit itself has been engineered. For this reason, in what follows, we do not formally specify the type and the unit measurements of the amount of chemicals, but we use the general term “level”. The level can indicate either the number of molecules or the concentration values that might then be actually measured by ad hoc laboratory methodologies after the construction of the circuit. For instance, if qRT-PCR is exploited to validate the functioning of the synthetically engineered circuit, then the level of the species in Γ represents the concentrations of the corresponding mRNA, measured by relative quantification with respect to the amount of some control housekeeping gene [37].
According to the law of mass-action1, given a set of biochemical reactions —each one characterized by its own kinetic constant kμ2 —for each molecular species appearing either as reagent or product in some reaction in , it is possible to derive a rate equation that describes the variation in its concentration with respect to time. In other words, any given RBM or GRM can be formalized as an equivalent system of coupled (nonlinear) first-order ODEs [71]. This formalization can be exploited to determine the temporal evolution of the network (that is, its dynamics) by means of different numerical integration methods [7]. These algorithms require as input the set of ODEs, along with the set of kinetic constants and the initial concentrations of the chemical species. To this aim, in this chapter, we exploit cupSODA [48, 50], a GPU-powered tool that automatically derives a set of ODEs from RBMs defined according to the mass-action kinetics, and then exploits the numerical integration algorithm LSODA [59] to perform the deterministic simulation of the system dynamics. In Section 6.4, we show how to exploit CGP to carry out the ED of a GRM, which is then automatically converted to an ODEs system and simulated by means of cupSODA, considering the kinetic constants that are inferred by PSO.
CGP is an EC method whose individuals are described by means of indexed graphs, rather than trees, having (sequentially numbered) nodes arranged in a Cartesian coordinate grid [43]. The genotype of each individual is represented by a sequence of node connections and functions in the grid, called Cartesian program (CP).
Formally, a CP is a 9-tuple {G, ni, no, F, FN, nr, nc, nn, l} where
The connections of a CP start from the input nodes and pass through the functional nodes, each one having a fixed number nn of input connections. The length of the genotype G is equal to nrnc(nn + 1) + no, that is, an integer number is assigned to each input connection and to the output connection of every functional node, as well as to each output node of the grid. The inter-connectivity of nodes generally exploited in CGP, and specifically considered in this chapter, is strictly feed-forward, meaning that nodes belonging to the same column of the grid cannot be connected to each other. In addition, any node can be either connected or disconnected; disconnected nodes represent non-coding genes in the genotype and are ignored in the phenotype. The phenotype of a given CP is given by the actual graph that the CP represents. In this chapter, the semantic of the phenotype corresponds to the set of reactions derived from the graph, that will constitute the GRM. An example of CP genotype and phenotype, in the context of GRMs, is given in Section 6.4.
In a CGP population, after the fitness function of each CP has been evaluated, the best candidate solutions are selected to generate the offspring by means of a mutation operator. During the iterative process, CGP evolves a population of individuals characterized by a set of expressions, each one formed by the composition of input nodes and functional nodes. The total number of possible expressions represented by a given CP is upper bounded by the number no of output nodes. If an output node is linked to a disconnected functional node, some internal nodes will not be part of any path connecting an input node to that output node, a circumstance that leads to the possible existence of different genotypes mapping to the same phenotype. These non-coding regions are important in CGP for three reasons: (1) they reduce the size of the phenotype, as stated above; (2) they reduce the effect of mutations since a mutation acting on a disconnected functional node will not contribute to the variation of the phenotype; (3) even if they are unused in the current CP, a mutation could suddenly connect some disconnected node to the rest of the graph, resulting in a possibly relevant change in the phenotype.
In this chapter, the candidate solutions evolve by using a (1+λ) ES [5], as described in Ref. [43]: all individuals are evaluated and the best one is selected as a parent for the next generation. Then, λ offspring are produced by means of random mutations, that is, random modifications of the integers which constitute the genotype of the parent individual. The proportion of genes that are mutated is determined by the mutation rate parameter ρ ∈ (0, 1). The ES methodology does not exploit any crossover mechanism.
In Section 6.4, we show how to exploit CGP to perform the ED of GRMs. In order to evaluate the GRM that each CP represents, we need to define a proper fitness function. As target of the optimization process, we exploit a set of artificial temporal data that represent the desired behavior of the synthetic gene regulation circuit. These artificial data informally correspond to the concentrations of the molecular species that might be experimentally measured after the construction of the circuit itself. Therefore, the fitness function is based on the comparison of the target series against a simulated dynamics of the model, which is generated by exploiting the cupSODA tool [48, 50] explained in Section 6.3.4. Since the GRM derived with CGP is not complete until a proper kinetic parameterization is given, the fitness evaluation of each candidate solution embeds a PE phase, which is performed by means of PSO.
PSO is a population-based meta-heuristic inspired by the collective movement of flock of birds and school of fish [29]. In PSO, a swarm of P candidate solutions (named particles) moves inside an M-dimensional Euclidean search space, cooperating in the identification of the optimal solution, whose quality is measured using a specified fitness function. The movement of the swarm is bounded inside a subspace of : if a particle is driven outside the boundaries of this subspace as a consequence of its movement, then it is relocated inside the feasible region by means of the damping mechanism described in Ref. [72].
The i-th particle of the swarm is defined by two vectors: the position of the particle and its velocity . During the iterative optimization phase, the velocity is updated as a result of two attractors: the best position ever found by the particle itself, and the best position identified by the whole swarm. These attractors are balanced by means of two settings specified by the user: the cognitive factor and the social factor . The components of both attractors are further modulated by two vectors R1 and R2 of random numbers uniformly sampled in [0, 1), which allows the particles to move stochastically and to avoid the entrapment into local minima. Moreover, to reduce the possibility of chaotic movements of particles, the change of velocity is weighted by an inertia factor . The final definition of the velocity update formula for the i-th particle is:
where ○ represents the component-wise multiplication operator between vectors. Once the velocity is updated, the new position of each particle is determined by calculating φi = φi + vi.
In this chapter, a particle corresponds to a candidate kinetic parameterization of each GRM determined by CGP. Therefore, M is equal to the number of connected outputs in the CP, that is, the total number of reactions in the GRM.
The Compute Unified Device Architecture (CUDA) is a parallel computing platform and programming model introduced by Nvidia in 2006, consisting in a framework suitable to exploit GPUs in general-purpose computational tasks (GPGPU computing). GPGPU computing represents a valuable alternative to the traditional high-performance computing infrastructures (e.g., clusters of machines) since GPUs are characterized by low costs and a reduced energy consumption, allowing the access to tera-scale computing on common workstations of mid-range price. Nevertheless, a direct porting of sequential code on the GPU is most of the times unfeasible, due to the innovative architecture and the intrinsic limitations of this technology. Therefore, a fully exploitation of the computational power and massive parallelism of the GPU is usually challenging [17].
CUDA combines the single instruction multiple data (SIMD) architecture with multi-threading, and automatically handles the conditional divergence between threads. However, this flexibility has a drawback since any divergence of the execution flow among threads results in a serialization of the execution, affecting the overall performances. Following the naming conventions used in CUDA, a C/C++ function, called kernel, is loaded from the host (the CPU) to the devices (one or more GPUs) and replicated in many copies named threads. Threads can be organized in three-dimensional structures named blocks which, in turn, are contained in three-dimensional grids (as schematized in Figure 6.1, left side). Whenever the host computer runs a kernel function, the GPU creates the corresponding grid and automatically schedules each block of threads on an available streaming multi-processor of the GPU, thus allowing a transparent scaling of performances on different devices.
GPUs are equipped with different types of memory. As described in Figure 6.1, right side, the GPU memory hierarchy consists in the global memory (accessible from all threads), the shared memory (accessible from threads of the same block), the local memory (registers and arrays, accessible from owner thread), and the constant memory (cached and not modifiable). The best performances in the execution of CUDA code can be achieved by exploiting the shared memory as much as possible. Unfortunately, the shared memory is a very limited resource (49,152 bytes for each multi-processor, since the introduction of the Fermi architecture) that brings about restrictions on the blocks’ size. On the contrary, the global memory is very large (thousands of megabytes) but suffers from high latencies. In order to mitigate this issue, starting from the Fermi architecture, the global memory has been equipped with a L2 cache (see Figure 6.2). Moreover, with the Fermi architecture, the programmer can balance 64 KB of fast on-chip memory between the shared memory and L1 cache, specifying two different configurations: 48 KB for the shared memory and 16 KB for L1 cache, or 16 KB for the shared memory and 48 for L1 cache. In addition, using the Kepler architecture (exploited in our work), a third and perfectly balanced configuration can be specified by assigning the same amount of memory (32 KB) both to the shared memory and L1 cache.
Despite the remarkable advantages concerning the computational speedup, computing with GPUs usually requires the development and the implementation of ad hoc algorithms since GPU-based programming substantially differs from CPU-based computing. As a consequence, scientific applications of GPUs might undergo the risk of remaining a niche for few specialists. To avoid such limitations, several packages and software tools were recently released (see, e.g., Refs. [24, 50, 69, 73]), so that users with no knowledge of GPUs hardware and programming can also access the high-performance computing power of graphics engines.
In this chapter, the simulations needed to execute the ED of GRMs were carried out by means of cupSODA, a GPU-powered simulator for biological systems that allows the efficient execution of high numbers of parallel deterministic simulations [50, 48]. cupSODA relies on a C version of the numerical integration algorithm LSODA [59], ported and adapted to the CUDA architecture. This tool requires an Nvidia video card and can be executed on Microsoft Windows, Linux, and Apple OS X-based operating systems. cupSODA was designed to the purpose of being a black-box simulator, that can be easily used without any programming skills. In particular, cupSODA automatically converts any mass-action-based RBM of a biological system into the corresponding system of ODEs.
cupSODA exploits the massive parallelism of CUDA architecture for the parallel execution of independent simulations—each one characterized by a different initial parameterization of the model—in each thread. cupSODA was designed to speed up the time-consuming tasks typical of computational biology [2, 10], which rely on the repetition of large numbers of simulations in perturbed conditions, generally realized by varying the initial concentrations of chemical species or the value of the kinetic constants. cupSODA takes advantage of the memory hierarchy of the GPU, by allocating the most frequently updated data (e.g., the state of the system) in the shared memory, and the unvarying data (e.g., the set of reactions in the RBM) in the constant memory.
cupSODA integrates a specific functionality for the evaluation of the fitness function since it allows to compare the outcome of the simulations with any available target data (e.g., the levels of chemical species at chosen time instants t0, …, tF). This is useful, for instance, to perform a PE task. To this aim, cupSODA invokes the LSODA kernel F − 1 times: each time the kernel is run over a (simulated) time interval of length Δt = ti − ti − 1, i = 1, …, F, and the concentration values of the output species are stored at the end of each Δt. Once the concentrations are stored, cupSODA provides a set of metrics (e.g., Equation 6.2) that can be exploited to evaluate in parallel the fitness function of each particle.
In this section, we describe our ED methodology, based on the integration of (1) CGP for the design of the synthetic circuit, (2) PSO for the estimation of the kinetic parameters of the corresponding GRM, and (3) cupSODA for the GPU-accelerated fitness evaluation.
To the best of our knowledge, this is the first time that the ED is performed by means of a hybrid approach based on CGP and PSO. The choice of PSO to perform the PE task is motivated by empirical studies in the field, highlighting its higher performances with respect to other popular global optimization techniques like GAs, differential evolution and evolution strategy [4, 13].
In cuGENED, CGP is implemented using F = { +, −} as the set of functions. The composition of input nodes and functional nodes allows CGP to encode complex expressions for each genotype G. The expressions can be then converted, by means of symbolic manipulation, into arithmetical equations as described hereafter. In the context of ED of GRMs, the set of input nodes is determined by the species occurring in Γ and Σ, so that ni = |Γ| + |Σ|. Since the exact regulatory interactions necessary to reproduce the desired behavior of the synthetic circuit are to be determined, it follows that the number of chemical species belonging to the set Σ is unknown as well. Therefore, also the cardinality of Σ must be inferred by means of some heuristics. As a matter of fact, the value |Σ| has a relevant impact on the ED process since it determines the space of the possible GRMs that CGP can explore. If |Σ| is too small, then a circuit that is able to achieve a perfect fit of the desired dynamics might be impossible to be designed. On the contrary, if |Σ| is too large, then the evolutionary algorithm can take a longer time to converge to an optimal solution, or might suffer from overfitting. Nevertheless, in the second case, CGP is able to automatically exclude the unnecessary chemical species from the set of candidate solutions if they have no impact on the fitness value. In this work, we always use the value |Σ| = 4 · |Γ| as heuristic, because we assume that, for each target species in Γ, four intermediates are enough to model gene regulation mechanisms.
The number of input connections for each functional node is nn = 2, because we only exploit addition and subtraction as functions for the construction of the expressions. The set of output nodes corresponds to the biochemical reactions that will be part of the inferred GRM. In these reactions, the species corresponding to the input nodes appear either as reactants, if a functional node containing function + is crossed, or as products, if a functional node containing function is crossed. In so doing, each expression encodes a single reaction of the GRM.
Generally, in CGP, the number of output nodes no is fixed. In the context of GRMs, this implicitly means that the actual number of reactions should be known before the optimization takes place, which is clearly unreasonable. Since the GRM itself is the goal of the optimization, in this chapter, the value no represents an upper bound to the number of reactions that will appear in the model. Hence, a proper choice for no is fundamental: it should be large enough to include all the necessary regulatory mechanisms, and small enough to avoid bloating and overfitting. As heuristic for the selection of no, we suggest to fix its value equal to the number of all possible reactions that can involve the set of genes and the set of generic chemical species, that is, no = |Γ| · |Σ| + |Γ| + |Σ|. Furthermore, in this implementation, we set l = nc, meaning that output nodes can be connected to any functional or input node, and each functional node belonging to column j can be connected only to nodes between column j − 1 and the input nodes.
An example of CP related to the ED problem of GRMs is explained in Example 6.1 and graphically represented in Figure 6.3.
■ EXAMPLE 6.1
There exists a strict correspondence between the genotype of the CP and the connectivity of each node in the Cartesian coordinate grid. Functions + and − in the grid are represented by 0 and 1 in the genotype, respectively (the first number in each of the four triplets). Since + and − are binary operators, we use nn = 2 input connections to each functional node. The grid of functional nodes consists in |FN| = nrnc = 4 nodes since the grid is composed of two rows and two columns. The genotype is generated by randomly drawing values related to the chemical species (0 for γ0, 1 for σ0, 2 for σ1) and to the output of each functional node (3, … , 6). Then, these values are assigned to the input of the functional nodes and to the output nodes (numerical values with asterisk). Note that, in the construction of the genotype, given a functional node, only the values of previous input and functional nodes can be randomly selected and assigned to its input connections since l = nc. The grid is parsed backwards, from the no = 2 output nodes R1, R2 to the ni = 3 input nodes γ0, σ0, σ1. The resulting equations are then automatically converted into an equivalent set of biochemical reactions. The grey functional nodes are involved in the equations since a path from the input nodes to the output nodes exists here. The white functional nodes, instead, are not connected in the grid and do not participate in any equation in this CP. Thus, their corresponding genes (gray numbers) are non-coding sequences.
The pseudo-code of our ED methodology based on CGP and PSO is reported in Algorithm 6.1. The evolutionary process begins with the creation of a population of I = 1 + λ random CPs (lines 2–7). The population evolves by means of an ES process which can be decomposed in three main steps:
Figure 6.3 schematizes the conversion of a CP into the corresponding GRM (Step 1). The connections of each output node are followed backwards by recursively passing through the functional nodes. This process yields a derivation tree that is translated into an arithmetical equation, where positive terms (respectively, negative) are considered as reactants (products). Each equation produces a single candidate chemical reaction for the network. A candidate GRM η for the synthetic circuit is then obtained by repeating the algorithm for all output nodes.
The fitness values of all candidate solutions are calculated at the end of the PE process by means of PSO, so that a ranking of the CPs can be assessed and the best solution (bestCP) in the CGP population is identified (Step 2).
During each generation of the CGP, the new offspring solutions are obtained by applying a mutation operator to bestCP. The GRM η is compared to the rest of the population: if it is identical to some other CP, it is mutated again in order to achieve a heterogeneous population, and the process is repeated. In this chapter, we do not verify that η consists of a single connected component since not all chemical species in Σ necessarily need to be present in the GRM. The rationale behind this choice is that, in the ED methodology presented here, we do not specify a priori the exact number of chemical species that should occur in the system (we only provide an upper bound). Therefore, in this methodology also the number of chemical species in Σ undergoes the optimization process. This approach is different from typical methods used for the RE problem. In particular, it differs from the RE method based on CGP and PSO that we previously proposed in Ref. [51]: in that case, the number of chemical species participating in the system was known a priori; therefore, it was mandatory for the candidate solutions to consist in a single connected component.
When a new population of CGP is generated, for each candidate solution we perform a consistency check (lines 24–28 in Algorithm 6.1). Specifically, we verify that each reaction in the GRM η obeys the conditions described in Section 6.3.1, together with the additional condition that no identical reactions appear in η. In particular, since we consider at most second-order reactions, we can limit the length of CP expressions and reduce the bloating by setting nc 3. In general, if a reaction R is not consistent, we remove it from the network, so that η = η∖{R}. Eventually, when a new consistent CP is produced, the individual is inserted into the new population.
To improve the convergence speed, at the end of Step 3 we apply the elitist selection by adding a non-mutated copy of the best candidate network found so far. This solution becomes the (I + 1)th individual of the population (lines 31–33 in Algorithm 6.1). When the new population is formed, we calculate the fitness value of each individual by executing a PE by means of PSO, as described hereby. This iterative process is repeated for generations. Finally, the GRM with the best fitness, along with its parameterization, is chosen as the result of the ED problem.
For the execution of the PE task, each particle in the PSO corresponds to a candidate kinetic parameterization of a GRM inferred by CGP. Namely, for each candidate GRM, we exploit a whole swarm of particles to determine the reaction constants that best fit the desired target behavior for that GRM. To this aim, the PSO adopts the following settings: P = 64 particles, randomly generated using a logarithmic distribution to better distribute the values of kinetic constants over different orders of magnitude; csoc = ccog = 2.05, as suggested in Ref. [13]; inertia linearly decremented from w = 0.9 to w = 0.4; velocity vector automatically clamped to a maximum intensity, equal to 10% of the maximum point-to-point distance in the search space; search space bounded between 10− 5 and 102 (for each kinetic parameter) with damping boundary conditions [72]. The choice of the upper bound of the search space for the kinetic constants is fundamental for the entire optimization process since a limited range of variation might exclude the global optimum. However, to perform PE of biochemical systems consisting in a small set of reactions, and characterized by molecular species with low concentrations, limiting the search space of parameters is a good practice for the following reason. Very high values of kinetic constants could lead to undesired dynamics in which chemical species are entirely consumed in the first time instants of the simulation.
The movement of the swarm during the PE phase is determined according to the fitness function. In this chapter, for each particle φi, i = 1, …, P, which encodes the kinetic parameterization of a GRM η, the fitness is defined as the normalized distance between the values of the simulated dynamics of gene expression levels in η and the desired target dynamics. Formally, given the set Γ of genes that are available for the synthetic engineering of the gene regulation circuit and the set of reactions of the GRM inferred by CGP, we denote by Xh(tc) the expected measurement at time tc of the hth chemical species in Γ and by its simulated value sampled at time tc. The value is obtained by means of cupSODA, using the kinetic constants contained in particle φi. The fitness function of particle φi is therefore given by:
where corresponds to the number of time instants that are arbitrarily sampled in the target dynamics of the GRM.
In this chapter, the PSO algorithm is halted after iterations. Then, for each GRM η, the value —that is, the fitness of the best solution g = (k1, …, kM) determined within the swarm of all candidate parameterizations of η—is taken as the fitness of the inferred GRM η. The GRM characterized by the minimum fitness value among all GRMs inferred by CGP is eventually chosen as the best solution of the whole ED problem.
Fitness evaluations are computationally expensive since multiple simulations (one for each particle) must be performed during all iterations of the PE on each candidate network inferred by CGP. More precisely, it takes fitness evaluations to perform a whole ED process. Anyway, all simulations during each PE phase are mutually independent and can be straightforwardly accelerated by means of a parallel architecture. In order to take advantage of the parallelism of modern GPUs, we exploit the GPU-accelerated simulator cupSODA [48, 50] and launch P threads, which perform the simulations and calculate the fitness functions defined in Equation 6.2. This way, the impact of fitness evaluations on the overall running time decreases to .
The rest of the methodology (CGP and PSO) is implemented using the Python language (version 2.7) and is executed in a strictly sequential fashion. PSO invokes cupSODA to assess the fitness values by means of synchronous subprocess calls.
To test the feasibility and the effectiveness of cuGENED, we performed the ED of synthetic circuits composed by two and three genes, using as target of the optimization process a desired temporal dynamics of a small set of (in silico generated) mRNA levels. The target dynamics were created from scratch to be complex enough to require the ED of GRMs possibly characterized by multiple regulation mechanisms and intermediate species.
As previously mentioned, the results of the ED process can be only evaluated by considering the desired dynamics. No target GRM can be exploited to assess the quality of the inferred network since the goal of the ED process is to determine a synthetic circuit that still has to be constructed in laboratory. Therefore, to prove the effectiveness of cuGENED, only comparisons between the expected and the simulated dynamics of the inferred GRM are presented hereby.
All the following tests were performed on a workstation with a CPU Intel Core i7-2600, with a clock frequency of 3.4 GHz, and with a GPU Nvidia GeForce GTX 590, running OS Windows 7 64 bit. The settings of the LSODA integrator used for the simulation of the dynamics of the candidate solutions are: relative error equal to 1 · 10− 10, absolute error equal to 1 · 10− 10, maximum number of integration steps equal to 10,000.
For the ED of synthetic circuits consisting of two genes, we set |Γ| = 2 and |Σ| = 8, so that the inferred GRM can contain at most ni = 10 chemical species and no = 26 reactions (according to the heuristics described in Section 6.4). The desired target dynamics of the species in Γ are described by two sigmoidal curves, characterized by different slopes. This behavior corresponds to a down-regulation of both genes, whose constitutive expression decreases in time by the action of some mutual regulatory mechanism, that has to be inferred by cuGENED.
Some preliminary tests were executed to analyze the influence of cuGENED settings on the ED process. In the first test, we investigated the impact of the parameter nr—that is, the number of rows in the grid of functional nodes—on the convergence speed (Figure 6.4). According to our results for the two-genes system under investigation, too small (nr = 8) or too large (nr = 14) values of nr yield a worse convergence of cuGENED, while the best results are achieved with nr = 10. Moreover, by analyzing the early phases of the CGP process we can observe that the higher the nr value the faster the convergence. Nevertheless, it is the intermediate value nr = 10 that achieves the best convergence speed, confirming a similar result presented in Ref. [51]. In addition, Figure 6.4 highlights a feature that is typical of methodologies embedding PSO into CGP, as previously discussed in Ref. [51]: due to the stochasticity of PSO, two different PE executions on the same network usually lead to two different kinetic parameterizations which, in turn, are likely to have two different fitness values, explaining the large fluctuations in the convergence speed plots.
As a second test, we analyzed the impact of the parameter ρ, which determines how many elements of a CP genotype are mutated during the offspring generation. Figure 6.5 shows the convergence of cuGENED with different values of ρ, highlighting that the best choice is ρ = 0.2. This setting represents the best trade-off between exploration and exploitation of the search space. Indeed, if the value of ρ is too small, CGP is not able to properly explore the search space and cannot converge to individuals characterized by a good fitness value. On the contrary, if the value of ρ is too large, the mutation easily disrupts the structure of the best individual, yielding random individuals with a large fitness value and reducing the advantage of an evolution-guided exploration of the search space.
The best setting identified for cuGENED (i.e., nr = 10 and ρ = 0.2) was then exploited to derive the GRM η2γ that achieves the best fit with the expected behavior of the two-genes system. Figure 6.6 shows the comparison between the desired target dynamics (dots) and the simulated dynamics (lines) of η2γ. This network is represented in Figure 6.7 and consists of the following reactions:
where the numerical values above the arrows correspond to the best kinetic parameterization determined by PSO.
All the reactions in η2γ are biologically consistent and contribute well to the definition of a network of plausible genetic interactions, which could be implemented by means of synthetic biology techniques. For instance, in η2γ we can observe the presence of a multi-product reaction (R6) [14] and alternative splicing reactions (R4 and R5) [40]. Furthermore, η2γ is characterized by a low fitness value, which indicates a consistent similarity with the expected target dynamics obtained by using the kinetic constants (k1, …, k8) inferred by PSO (Figure 6.6). Finally, it is worth noting that this optimal solution does not exploit all the chemical species in the set Σ: the CGP avoided the bloating and evolved a well-fitting GRM that only contains 4 out of the 8 possible species in Σ.
As a further test, we performed the ED on a system consisting of three genes, using the best settings of cuGENED identified during the preliminary tests (Section 6.5.1). For the ED of this synthetic circuit, we set |Γ| = 3 and |Σ| = 12, so that the inferred GRM can contain at most ni = 15 chemical species and no = 51 reactions (according to the heuristics described in Section 6.4).
The desired target dynamics of the species in Γ are described by two sigmoidal curves, characterized by different slopes, and a monotonically increasing curve (Figure 6.8, dots). This behavior corresponds to a down-regulation of two genes and an up-regulation of a third gene, whose underlying regulatory interplay is to be inferred by means of cuGENED.
The best GRM η3γ inferred by cuGENED is represented in Figure 6.9 and consists in the following reactions:
where the numerical values above the arrows correspond to the best kinetic parameterization determined by PSO.
The result in Figure 6.8 shows that η3γ almost perfectly fits the desired dynamics. In particular, the target and simulated dynamics for γ0 and γ1 are perfectly overlapped, while the curve of γ2 slightly diverges from the expected behavior. Nevertheless, the simulated behavior of γ2 is qualitatively and quantitatively similar to the target behavior, which means that, from a biological standpoint, the ED goal is fully achieved.
The interaction network in Figure 6.9 has two important characteristics, highlighted by the dashed lines, which represent side effects of cuGENED:
Using a GPU Nvidia GeForce GTX 590 equipped with 1024 cores, the ED of η2γ required 70136 seconds (about 19 hours), while η3γ required 72645 seconds (about 20 hours). Thus, the running time of cuGENED presumably scales less than linearly with the number of genes, making it feasible for the ED of systems characterized by a large number of genes. The increased running time is probably due to the different size of the candidate networks: the GRMs for the three-genes system are indeed characterized by a larger number of species and reactions, a circumstance that inevitably slows down the simulations. On the other hand, thanks to the use of the cupSODA deterministic simulator [48], the fitness evaluations are computed in parallel, so that the running time is not affected by the number of particles used in the PSO phase.
In order to compare the performances of cuGENED using CPU and GPU implementations of LSODA, we run 640 simulations—corresponding to 64 random parameterizations of 10 random GRMs, each one determined by a CP—which are equivalent to a single CGP iteration. As a reference CPU implementation we used Scipy’s odeint
integrator [27], based on LSODA from the FORTRAN library odepack. The running time to perform the simulations was 1.5 s on the CPU, and 0.061 s on the GPU by using cupSODA, corresponding to a 24.5 × speedup achieved by cuGENED.
cuGENED consists in the integration of two evolutionary algorithms which work at two different levels: CGP to infer the structure of a GRN, and PSO to estimate the kinetic parameters of the reactions involved in the network. To the best of our knowledge, cuGENED represents the first attempt to tackle the ED of GRMs by means of CGP combined with PSO. This methodology is also the first attempt to automatically model a GRN assuming 0-knowledge on the set of molecular species. In this context, there are some issues that are worth considering.
In cuGENED, the ED problem is formalized without any a priori information on the structure of the system, defined in terms of molecular interactions, except for a desired target dynamics of a small subset of species. On the one hand, this information is sufficient to properly reconstruct a network whose dynamics fits the target data (see, for instance, Figure 6.6). On the other hand, this information alone is usually not enough to discriminate between possibly different network topologies [11], which could all be able to reproduce the same behavior used as target of the optimization. This well known issue in ED and RE is named indistinguishability of equivalent networks [68]. To lessen this computational difficulty, a good practice should be the multiple execution of the ED process, followed by an analysis of the obtained solutions to assess their laboratory feasibility. This analysis could as well be performed by means of automatic algorithms, which exploit a knowledge base of feasible reactions specifically defined by synthetic biologists.
With respect to other existing methodologies for the inference of GRNs, an advantage of cuGENED is that, thanks to the use of CGP, it allows to exploit a simple representation of the candidate GRMs. The genotype of CGP, indeed, is a fixed-length vector of integer values, whose corresponding phenotype is a set of human-readable chemical reactions. Moreover, cuGENED requires the estimation of only M kinetic parameters associated with the reactions, where, in general, M ≪ (N2 + 1)/2. On the contrary, other existing GP methodologies generally necessitate complex parameterizations. For instance, S-tree-based methods require the estimation of 2N(N + 1) kinetic parameters [9], thus making this method hardly exploitable for systems involving many genes. Moreover, thanks to the explicit limitation to second-order reactions in GRMs, cuGENED does not generate families of candidate solutions having an identical structure but a different stoichiometry, which still represents an open issue in GP-based methodologies [39].
Another relevant strength of cuGENED is the ED of synthetic circuits formally defined as GRMs, which, in contrast to most existing approaches [8], are not based on arbitrary kinetics functions. The GRNs evolved by CGP and PSO consist in mass-action reaction-based models, which describe the biological processes in terms of simple molecular interactions, and not as reaction rate approximations based on some kind of chemical assumption. As a matter of fact, Hill functions were shown not to represent physically realistic reaction schemes [70], especially for gene expression processes [63], while the equilibrium or quasi steady-state assumptions at the basis of Michaelis–Menten constants are valid only in specific conditions [46].
A current drawback of cuGENED is that the choice of its initial setting is relevant to converge to an optimal solution since both CGP and PSO are not settings-free algorithms. We relied on common literature settings for the PSO, while we analyzed the impact of different parameter values on the performances of CGP (i.e., the mutation rate ρ and the number of rows nr). As a future improvement of cuGENED, we will investigate the relationship between these two values, the number of genes and the complexity of the desired target dynamics, in order to derive a heuristic for the automatic selection of the most appropriate settings.
cuGENED relies on the nested execution of two evolutionary algorithms. As a consequence, its overall computational cost can be relevant, being proportional to the number of generations of CGP, to the number of iterations of PSO, to their population size, to the number of reactions involved in the candidate networks, and to the time length of the simulations. The largest part of the running time is due to the simulation of the dynamics of the candidate solutions, which is a fundamental step for the fitness computation. However, the computational cost can be strongly reduced by exploiting the GPU-powered cupSODA simulator [50]. In so doing, we realized a parallel execution of the simulations and the fitness calculations of all candidate solutions, achieving a 24.5 × speedup with respect to a strictly sequential execution of the same tasks. It is worth noting that the computational time could be further reduced by executing parallel ED instances. To the best of our knowledge, cuGENED also represents the first attempt ever in exploiting GPUs to perform an accelerated ED of GRMs.
The idea of a circuit-like connectivity between biological parts was postulated for the first time in the 1960s [45]. This intuition lead to several attempts to properly formalize biological regulation systems through mathematical models [22, 23, 28, 64] and analyze the cellular pathways under investigation by exploiting electrical circuit analogies [41, 42]. This research field was favored by the progression of molecular biology and genomics, which provided several methods and the necessary knowledge to physically assemble biomolecular components. As a matter of fact, it is nowadays possible to properly evaluate the interaction among different genes through expression data coming from high-throughput techniques like microarrays [20] and real-time PCR [34]. It is also possible to engineer in vitro or in vivo customized signaling circuits [36]. The proof of concept that a computing-like behavior could be applied to biological systems was the design of the first synthetic gene networks, realized by using engineering-based methodologies [16, 19]. Lately, several network models were proposed, all of them integrating biochemical pathway information and expression data [1, 6, 38, 58].
In this context, we developed cuGENED, a computational methodology for the automatic design of GRNs characterized by a predefined dynamical behavior. In particular, cuGENED allows to derive a mechanistic model of a gene regulation system, modeled as a parameterized set of biochemical reactions. The reactions describe the processes related to gene expression, and they involve a set of molecular species (e.g., genes, mRNA) whose dynamics represent the target behavior of the optimization.
It is often the case that a gene can regulate either itself (autoregulation) or a single other gene. This is due to the fact that a generic regulatory species might be able to interact with a single component of the gene network through a unique DNA sequence. This means that all species could be extremely selective for their molecular target (i.e., each species can bind to a unique DNA sequence), so that no unknown interactions can occur between the regulators and their corresponding molecular targets [18]. Therefore, the molecular target of a regulator could be either the gene itself or another gene, for example, in a very selective feedback system. Thus, as future developments of this work, we plan to improve the ED methodology as follows:
We also plan to introduce an automatic mechanism to detect the unessential species (e.g., products that have no impact on the system dynamics), as well as the reactions that cannot take place (i.e., reactions whose reactants have zero concentration, like those highlighted in Figure 6.9). In addition, we will consider the fact that the inference process might converge to a GRM whose simulation perfectly fits the desired behavior, though some reactions appearing in the GRM might not be biologically plausible. As mentioned in Section 6.6, a solution to this problem might come from the inclusion of some knowledge domain constraints in the ED, in order to automatically remove such reactions from the candidate GRMs.
Finally, in cuGENED we have not explicitly taken into account the biological noise that is typical of gene expression [15]. Actually, some species involved in GRNs can have very low intracellular amounts (in the order of tens or a few hundreds of molecules), so that a stochastic simulation methodology could be preferable with respect to deterministic simulations to correctly reproduce any noise-induced emergent phenomena. cuGENED can be straightforwardly extended to support also stochastic simulation algorithms for the simulation of the temporal dynamics of candidate solutions [21, 52, 62]. Indeed, as a further development, we will investigate the possibility of the automatic reconstruction of networks in presence of intrinsic noise, oscillations, or multistability phenomena.