Yuji Zhang
Department of Epidemiology and Public Health, University of Maryland
School of Medicine, Baltimore, MD, USA
and
Division of Biostatistics and Bioinformatics, University of Maryland
Greenebaum Cancer Center, Baltimore, MD, USA
In all living organisms, deoxyribonucleic acid (DNA), ribonucleic acid (RNA), and protein are three types of biological macromolecules that are indispensable for all biological processes. They are unbranched polymer chains, formed by the string together of monomeric building blocks drawn from a standard repertoire that is the same for all living cells. These molecules interact with each other frequently, and conditionally depend on each other to provide complex biological functions (e.g., functions of a protein are usually provided by its interactions with other proteins and genes). These molecules and their interactions compose complex networks, called gene regulatory networks (GRNs).
Gene regulatory networks are one of the most important biological networks in the bioinformatics and systems biology field. They play a vital role in almost every biological process, including cell differentiation, metabolism, cell cycle, and signal transduction. By understanding the properties and dynamics of these networks, we can shed light on the mechanisms of diseases that occur when these cellular processes are dysregulated. Analysis and inference of GRNs will also guide biologists in downstream biological experimental designs, as such inferences are more time- and cost-effective than wet lab experiment validations. In general, there are two different types of computational approaches for analysis and inference of GRNs:
This chapter is organized as follows: we will first review the biological data sources available for analysis and inference of GRNs, and then we will introduce the topological analysis approaches for GRNs. We will briefly review different types of computational approaches for GRN inference as well as our proposed approach for GRN inference by integrating prior biological knowledge. Finally, we will conclude the chapter with discussions and future works in the GRN analysis and inference field.
In this section, we describe multiple sources of biological data that have been used for GRN analysis and inference. This will help us better understand how integration of these different types of biological data brings us a more complete picture in GRN inference.
Gene expression data can be divided into two levels: mRNA level and protein level. In this chapter, we focus on the gene expression data on mRNA level, including cDNA microarrays, high-density oligonucleotide chips, reverse transcriptase polymerase chain reaction (RT-PCR), and RNA-seq.
Originally developed at Stanford University, cDNA microarrays are glass slides on which cDNA has been deposited by high-speed robotic printing [13]. They are ideally suited for expression analysis of up to 50,000 cDNA clones per array from expressed sequence tag sequencing projects (e.g., private effort at Incyte Pharmaceuticals and the public Washington University project). Measurements are carried out as differential hybridizations to minimize errors originating from cDNA spotting variability: mRNA from two different sources (e.g., control and drug treated), labeled with two different fluorescent dyes, is passed over the array at the same time. The fluorescence signal from each mRNA population is evaluated independently, and then used to calculate the treated/control expression ratio.
These chips, produced by Affymetrix [14], consist of small glass plates with thousands of short 20-mer oligonucleotide probes attached to their surface. The oligonucleotides are synthesized directly onto the surface using a combination of semiconductor-based photolithography and light-directed chemical synthesis. Due to the combinatorial nature of the process, very large numbers of mRNAs can be probed at the same time. However, manufacturing and reading of the chips requires expensive equipment. Current chips have over 65,000 different probes, with typically several probes for each mRNA.
To measure gene expression using RT-PCR, the mRNA is first reverse-transcribed into cDNA, and the cDNA is then amplified to measurable levels using PCR [15]. Using built-in calibration techniques, RT-PCR can achieve high accuracy coupled with an exceptional sensitivity of 10 molecules/10 μl and a dynamic range covering 6–8 orders of magnitude. The method requires PCR primers for all the genes of interest, and is not inherently parallel like the previous two, so automation is crucial to scale up.
Roland Somogyi used this method to measure the expression levels of 112 genes at 9 different time points during the development of rat cervical spinal cord [16], and 70 genes during development and following injury of the hippocampus.
RNA-seq, also called “whole transcriptome shotgun sequencing,” is a recently developed approach to transcriptome profiling that uses deep-sequencing technologies [17]. Compared to hybridization-based approaches, RNA-seq has the advantages to (1) detect novel transcripts, (2) have very low background noise, (3) contain a large dynamic range of expression levels over which transcripts can be detected, (4) have higher level of reproducibility, and (5) require less RNA sample [17]. It has been applied to various organisms, including Saccharomyces cerevisiae [18], Schizosaccharomyces pombe [19], Arabidopsis thaliana [20], zebrafish [21], mouse [22], and human cells [23].
Protein–protein interactions are essential for a wide range of cellular processes and form a network of astonishing complexity. Until recently, our knowledge of this complex network was rather limited. The emergence of large-scale protein–protein interaction maps has given us new possibilities to systematically survey and study the underlying biological system. First attempts to collect protein–protein interactions on a large scale were initiated for model organisms such as S. cerevisiae, Drosophila melanogaster, and Caenorhabditis elegans [24, 25, 26, 27, 28]. Evidently, the generated interaction maps offered a rich resource for systematic studies of molecular networks.
After these initial efforts, the focus has moved toward deciphering the human protein–protein interactions. Most currently available human interaction maps can be divided into three classes: (1) maps obtained from the literature search [29, 30, 31], (2) maps derived from interactions among orthologous proteins in other organisms [32, 33, 34], and (3) maps based on large scans using yeast two-hybrid (Y2H) assays [35, 36]. All of these different mapping strategies have their obvious advantages as well as disadvantages. For example, Y2H-based mapping approaches offer rapid screens between thousands of proteins, but might be compromised by large false-positive rates. The extent, however, of how much the resulting interaction maps are influenced by the choice of mapping strategy is less clear. Thus, it is important to critically assess and compare the quality and reliability of produced maps.
Protein–protein interaction networks are commonly represented in a graph format, with vertices corresponding to proteins and edges corresponding to protein–protein interactions [37]. The network consists of many small subnets (groups of proteins that interact with each other but not with any other protein) and one large, connected subnet comprising more than half of all interacting proteins. The volume of experimental data on protein–protein interactions is rapidly increasing, thanks to high-throughput biotechniques that are able to produce a large number of protein–protein interactions. For instance, yeast contains over 5000 proteins, and currently about 18,000 protein–protein interactions have been identified among the yeast proteins, with hundreds of labs around the world constantly adding to this list [38]. The analogous networks for mammals are expected to be much larger. For example, humans are expected to have around 12,000 proteins and about 106 interactions.
Currently, there are several different sources available for protein–DNA interactions: (1) experimental data from genome-wide location analysis (GWLA) [39], such as ChIP-chip [40] and ChIP-seq [41], (2) curated binding information in public databases, such as TRANSFAC [42], and (3) putative binding sites based on computational prediction algorithms [43]. As an in vivo study, GWLA technology is biologically most significant, but provides only the roughest information about possible binding sites. As more advanced sequencing technologies are being developed, it is expected that such approaches will overcome these limitations in the future. Curated information in public databases, on the other hand, presents a compilation of mostly in vitro studies and provides more accurate information, but at the expense of only small coverage of all intergenic regions. The third method is based on in silico predictions and provides the most detailed information on DNA-binding site locations, but contains the highest rate of false positives. As more interactions accumulate from different resources, interactions that are identified by more than one resource will be considered as high-confidence interactions. This will help us reduce the false positives arising from different resources.
The Gene Ontology (GO) Consortium [44] has developed three separate ontologies—molecular function, biological process, and cellular component—to describe the attributes of gene products, where molecular function defines what a gene product does at the biochemical level without specifying where or when the event actually occurs or its broader context; biological process describes the contribution of a gene product to a biological objective, and cellular component refers to where in the cell a gene product functions. Each GO is structured as a directed acyclic graph, wherein each term is a child of one or multiple parents, and child terms are instances or components of parent terms. For example, in Figure 3.1, the term S phase of meiotic cell cycle (GO: 0051332) is an instance of the term S phase (GO: 0051320) as well as an instance of the term interphase of meiotic cell cycle (GO: 0051328). Such information can be incorporated into GRN analysis and inference approaches to increase the inference accuracy [45, 46].
A GRN can be defined as a directed graph: the nodes represent the genes, a directed edge from one node to another indicates that the first gene codes for a transcription factor regulates the second genes, and an undirected edge between two nodes represents their interactions at protein level, which further direct these two proteins to regulate some common downstream genes [46]. The architecture of GRN can be described by means of graph features such as node degree, network diameter, and clustering coefficient. We briefly introduce these concepts, followed by a description of GRN analysis approaches at different network levels.
The degree of a node is defined as the number of edges that connect to this node. In directed networks, the number of incoming edges is called the in-degree, and the number of outgoing edges is called the out-degree. If a node has a high degree, it indicates that this node is connected to many other nodes in the network. Biological networks are not randomly organized but have a scale-free architecture with the typical power law degree distribution, that is, only a small number of nodes have a high degree while most nodes have a small degree. The GRN network, protein–protein interaction network, and metabolic network are all scale-free networks [47]. The advantage of this kind of organization is that the loss of one non-hub link is not as disruptive in scale-free networks as in random networks. In other words, scale-free networks are generally more robust. The hubs are extremely important and usually play essential roles in many biological systems [48].
The connectivity of a node is the number of its neighbors. The neighborhood connectivity of a node is defined as the average connectivity of its all neighbors. In analogy to the in- and out-degree, every node in a directed network has in- and out- connectivity.
The shortest path length between two nodes in one network is called the node distance. The network diameter is the maximum length of shortest paths between any two nodes in one network. If a network is disconnected, its diameter is the maximum of all diameters of its connected components. The distribution of shortest path length and network diameter can indicate small-world properties of the analyzed network [49]. Many biological networks, such as GRN and metabolic networks, are known to exhibit this small-world property.
There are also other important network parameters such as clustering coefficient, betweenness centrality, and stress centrality in network topological analysis; please refer to Doncheva et al. [50] for a detailed review.
In addition to the network properties described in Section 3.3, studies have also unveiled that biological networks have modular structure in most organisms [51, 52, 53, 54]. Indeed, biological processes consist of pathways that mainly act on their own and cross talk with each other under certain conditions. Therefore, it is expected that the distinct biological processes can be organized in discrete and separable modules. A module in a network can be defined in various ways [55]:
Furthermore, Shen-orr et al. [56] discovered the presence of NMs in the transcriptional network of Escherichia coli. Network motifs are the smallest building blocks in networks. They are topologically distinct regulatory interaction patterns that are present more frequently in true biological networks than in random networks. Therefore, these motifs must have a specific biological function: they are postulated to be the basic signal transduction elements, each with its own characteristic properties. Shen-orr et al. were the first to identify these NMs. Examples of NMs are the single input motif and multiple input and feed-forward loop motifs [56].
Besides network properties of GRNs, inference of GRNs is another challenging topic in the bioinformatics field. The purpose of GRN inference is to determine for all transcription factors the regulatory mechanisms they recognize, the conditions in which they are active, the regulators they cooperate with in these conditions, and their target genes in these conditions. The approach can be categorized into two groups: bottom-up and top-down approaches [57, 58]:
One approach to tackle the above underdetermined problem in top-down network inference is to integrate the multi-complementary high-throughput datasets. Transcriptional regulation is a process that needs to be understood at multiple levels of description [59, 60] (Figure 3.2), including (1) the factor–target gene interaction, in which transcription factors activated under certain conditions interact with their conserved binding site sequences, and (2) transcriptional regulation, which explains how the bindings of transcription factors to their unique recognition sites regulate the expression of specific genes. A single source of information such as gene expression data is aimed at only one level of description (i.e., transcriptional regulation level), and thus it is limited in its ability to obtain a full understanding of the entire regulatory process. Other types of information such as protein–protein interaction [61, 62] and protein–DNA interaction [40] data provide complementary constraints on the models of regulatory processes. By integrating limited but complementary data sources, we can realize a mutually consistent hypothesis bearing stronger similarity to the underlying causal structures [60]. Among the various types of high-throughput biological data available nowadays, time-course gene expression profiles and GWLA data are two complementary sets of information that can be used to infer regulatory components. Time-course gene expression data are advantageous over typical static expression profiles as time can be used to disambiguate causal interactions. GWLA data, on the other hand, provide high-throughput quantitative information about in vivo binding of transcription factors to the target regulatory regions of the DNA. Incorporation of prior biological knowledge accumulated in literature will help guide inference from the above datasets, and integration of multiple data sources offers insights into the cellular system at different levels [46].
Another way to reduce the complexity of the GRN inference problem is to decompose it into small units of commonly used network structures, called gene regulatory modules. As we introduced in this section, GRNs are made of repeated occurrences of simple patterns–NMs. Since the establishment of the first NM in E. coli [56], similar NMs have also been found in eukaryotes including yeast [63], plants, and animals [64, 65, 66], suggesting that the general structure of NMs is evolutionarily conserved. One well-known family of NMs is the feed-forward loop [67], which appears in hundreds of gene systems in E. coli [56, 68] and yeast [63, 69], as well as in other organisms [64, 65, 66, 70, 71, 72]. A comprehensive review on NM theory and experimental approaches is presented in Ref. [73]. Knowledge of the NMs to which a given transcription factor belongs facilitates the identification of downstream target gene clusters. In yeast, a GWLA was carried out for 106 transcription factors and 5 NMs were considered significant: autoregulation, feed-forward loop, single input module, multi-input module, and regulator cascade [63]. In Section 3.4, we will review commonly used models for GRN inference, followed by the introduction of our proposed computational approach integrating multi-source biological data for GRN inference.
In the last two decades, a variety of continuous or discrete, static or dynamic, and quantitative or qualitative models have been proposed for inference of GRNs. These include biochemically driven methods [74], linear models [75, 76], Boolean networks [77], fuzzy logic [78, 79], Bayesian networks [80, 81], and recurrent neural networks (RNNs) [82, 83, 84]. Chapter 2 provides a detailed description of these approaches. However, all these computational approaches described in Section 3.3 still cannot solve the underdetermined problem in GRN inference due to the typical small sample size compared to the number of genes investigated. We hypothesize that we can enhance our understanding of gene interactions in important biological processes and improve the inference accuracy of a GRN by (1) incorporating prior biological knowledge into the inference scheme, (2) integrating multiple biological data sources, and (3) decomposing the inference problem into smaller network modules. In this section, we will introduce our proposed integrative framework to tackle these challenges.
The GRN inference only based on gene expression data is inadequate because of its intrinsic complexity and noise in gene expression data. Integrating data from multiple global assays and curated databases is essential to understand the spatiotemporal interactions within cells. Different experiments measure cellular processes at various widths and depths, while databases contain biological information based on established facts or published data. Integrating these complementary datasets helps infer a mutually consistent transcriptional regulatory network with strong similarity to the structure of the underlying gene regulatory modules. Decomposing the transcriptional regulatory network into a small set of recurring regulatory patterns, called NMs, facilitates the inference. Identifying NMs defined by specific transcription factors establishes the modular framework structure of a transcriptional regulatory network and allows the inference of transcription factor–target gene relationship. This section introduces a computational framework for utilizing data from multiple sources to infer transcription factor–target gene relationships on the basis of NM regulatory modules. The data include time-course gene expression profiles, molecular interaction data, and GO information.
In the proposed framework, we consider two different layers of networks in the GRN. One is the molecular interaction network that includes protein–protein interactions and protein–DNA interactions at the factor–gene binding level. The other is the functional network that incorporates the consequences of these physical interactions, such as the activation or repression of transcription. We used three types of data to reconstruct the GRN, namely protein–protein interactions derived from a collection of public databases, protein–DNA interactions from the TRANSFAC database [42], and time-course gene expression profiles. The first two data sources provided direct network information to constrain the GRN model. The gene expression profiles provided an unambiguous measurement on the causal effects of the GRN model. GO annotation describes the similarities among genes within one network, which facilitates further characterization of the relationships between genes. The goal is to discern dependencies between the gene expression patterns and the physical intermolecular interactions revealed by complementary data sources.
The framework model for GRN inference is illustrated in Figure 3.3. Three successive steps were involved in this framework as outlined in the following:
Genes with similar expression profiles were represented by a gene module to address the scalability problem in GRN inference [79]. The assumption is that a subset of genes that are related in terms of expression (co-regulated) can be grouped together by virtue of a unifying cis-regulatory element(s) associated with a common transcription factor regulating each and every member of the cluster (co-expressed) [85]. GO information was utilized to define the optimal number of clusters with respect to certain broad functional categories. Since each gene module identified from clustering analysis mainly represents one broad biological or process category as evaluated by FuncAssociate [86], the regulatory network implies that a given transcription factor is likely to be involved in the control of a group of functionally related genes [87].
To reduce the complexity of the inference problem, NMs were utilized instead of a global GRN inference. The significant NMs in the combined molecular interaction network were first established and assigned to at least one transcription factor. These associations were further used to reconstruct the regulatory modules. This step was implemented using the FANMOD tool.
For each transcription factor assigned to an NM, a RNN was trained to model a GRN that mimics the associated NM. GA generated the candidate gene modules, and Particle Swarm Optimization was used to configure the parameters of the RNN. Parameters were selected to minimize the root-mean-square error (RMSE) between the output of the RNN and the target gene module's expression pattern. The RMSE was returned to GA to produce the next generation of candidate gene modules. Optimization continued until either a pre-specified maximum number of iterations were completed or a pre-specified minimum RMSE was reached. The procedure was repeated for all transcription factors. Biological knowledge from databases was used to evaluate the predicted results.
We applied this computational framework to two biological processes: yeast cell cycle progression process [88] and human Hela cancer cell cycle [89]. We demonstrate that our method can accurately infer the underlying relationships between transcription factor and the downstream target genes by integrating multi-sources of biological data. The predictive strength of this strategy is based on the combined constraints arising from multiple biological data sources including time-course gene expression data, combined molecular interaction network data, and GO category information.
The analysis and inference of GRNs is a major obstacle in bioinformatics majorly due to (1) intrinsic complexity of gene regulation mechanisms, (2) limited sample size compared to gene numbers in one experiment, and (3) noise in gene expression data themselves. Computational approaches integrating multi-source biological data can address this challenge by reverse engineering GRNs at NM level. However, there are still needs for developing models that can integrate more types of biological data.
Biological systems are characterized by many highly interconnected levels. Most approaches analyze the GRNs at transcriptional level. This network may be augmented with additional types of relations, which can then provide further insight into other types of cellular mechanisms. One of the major tasks ahead is therefore the integration of more sources of information. One intriguing dataset to add is that of signal transduction pathways, including directed protein–protein interactions as those between kinases and their substrates, and interactions between signaling molecules (e.g., pheromone) and their targets. Assuming these data were available, it will enable the characterization of signal transduction pathways and their control mechanisms. Attempts to collect genome-wide signaling data are underway, for example, using protein chips designed to test kinase phosphorylation interactions as performed by Snyder et al. [90].
This work is supported in part by grant P30 CA 134274-04 from the NCI.