Identifying CpG Islands: Sliding Window and Hidden Markov Model Approaches
Raina Robeva*, Aaron Garrett†, James Kirkwood* and Robin Davies‡, *Department of Mathematical Sciences, Sweet Briar College, Sweet Briar, VA, USA, †Department of Mathematical, Computing, and Information Sciences, Jacksonville State University, Jacksonville, AL, USA, ‡Department of Biology, Sweet Briar College, Sweet Briar, VA, USA
The dinucleotide (a cytosine followed by a guanine on a single DNA strand) has a pattern of non-homogeneity along the genome. Regions of relatively low frequencies are interrupted by clusters with markedly higher and content, known as islands (CGIs). CGIs are often associated with the promoter regions of genes, and methylation of the promoter CGI is associated with the transcriptional silence of the gene. Conversely, promoter-associated CGIs in constitutively expressed housekeeping genes are unmethylated. Appropriate methylation of CGIs is required for normal development, and inappropriate methylation of CGIs in tumor suppressor promoters has been associated with the development of numerous human cancers.
In higher multicellular organisms the genetic composition of an individual is determined by the fusion of sperm and egg nuclei following fertilization. With a few exceptions1 all cells of a multicellular organism have the same DNA sequence. However, the cells of the multicellular organism have very different patterns of gene expression and thus make very different groups of proteins. During the process of development, cells become differentiated and take on their mature pattern of gene expression. The following question is thus important: Once a tissue is produced during development, how is the tissue-specific pattern of gene expression maintained? Part of the answer lies with the production of specific proteins involved in the transcription of specific genes, part involves the histones which package the DNA, and another part of the answer lies in the chemical alteration of the DNA itself.
In complex organisms a fraction of the cytosine DNA bases may be methylated, with the degree of cytosine methylation varying considerably among fungi, plants, invertebrate and vertebrate animals [1]. Methylation of cytosine occurs on the #5 position (see Figure 9.1) and the resulting entity is called 5-methyl cytosine. If we were to compare the DNA of a pair of differentiated cells (e.g., liver cells or skin cells), we would observe that they had different patterns of methylation. Patterns of methylation are correlated with patterns of gene expression in an inverse relationship, in which silent (non-expressed) genes are methylated. In a particular differentiated cell type, the pattern of methylation is maintained through successive mitoses by the action of enzymes called maintenance methylases.
Figure 9.1 Comparison of unmethylated (left panel) and methylated (right panel) cytosines. The arrow in the left panel marks the #5 position. (The carbons and nitrogens are numbered, in this case, counter-clockwise beginning with the nitrogen on the bottom.)
In vertebrate animals, methylated cytosines occur in the dinucleotide sequence . This dinucleotide is interesting in that its complement on the other strand of DNA is also , and if the on one strand is methylated, the on the other strand is too. This state of affairs enables the pattern of DNA methylation to be perpetuated through successive rounds of replication. When a DNA sequence containing methylated in a dinucleotide is replicated, the two daughter strands will each have the on the template strand methylated and the on the new strand unmethylated. DNA in this state of half-methylation is the substrate for the maintenance methylase, which will methylate the unmethylated on the new strand and thus restore the methylation pattern of the parent DNA strand. The methylation pattern, and the pattern of gene expression, will be inherited through subsequent mitoses.
Methylation of dinucleotides is required for normal embryonic development and patterns of methylation must be established following a generalized demethylation that occurs early in embryonic development [2]. The new methylation patterns are established by de novo methylases and appear to contribute to lineage restriction during development [3,4]. In other words, when pluripotent stem cells give rise to tissue-specific stem cells with more limited differentiation potential, the promoters of a subset of genes which were formerly active in the pluripotent stem cells become methylated and transcriptionally silent [5].
Over time, both methylated and unmethylated cytosines may undergo random deamination reactions. The impact of these deamination reactions differs depending upon the methylation state of the cytosine. Methylated cytosine is deaminated to thymine, while unmethylated cytosine is deaminated to uracil (see Figures 9.2 and 9.3). Uracil bases do not belong in DNA and are recognized and removed by a repair enzyme, which restores the uracil to a in its position. The represents a mismatch (as it does not pair with ) and many of these Ts may also be repaired through the action of thymine-DNA glycosylase [6], but many will escape repair and the mutation will be propagated by subsequent rounds of replication. This gives rise to a to transition and the loss of a . This means that, on an evolutionary timescale, unmethylated s tend to be preserved and methylated s tend to be eliminated.
Because of this systemic depletion of methylated cytosines, vertebrates have many fewer dinucleotides than would be predicted by chance. Interestingly, when the DNA of vertebrates is examined, sequence information reveals that many of the dinucleotides that do remain tend to occur in CGIs. The length of these CGIs varies, but they have been reported to range from several hundred to several thousand nucleotides. CGIs are found before genes (i.e., in their promoter regions), in the coding sequences themselves, and after the coding region as well.
The cytosines in CGIs in promoters are normally unmethylated, and many of these un-methylated CGIs are found in the promoters of important housekeeping genes—the genes that must function in order to support life, including the genes for enzymes involved in aerobic respiration, transcription, and translation [7]. Since the s in the CpG islands are unmethylated, any deamination events would be detected and repaired and they would not be lost over evolutionary time. Since all cells, even those which give rise to sperm and eggs, must express their housekeeping genes, promoters of housekeeping genes would retain their dinucleotides and appear as CGIs [7].
However, CGIs are not confined to the promoters of housekeeping genes. CGIs are found in the promoters and protein coding regions (exons) of about 40% of mammalian genes [8]. Since CGIs seem to be located at the promoter regions of many known genes, identifying CGIs may be a useful method for identifying new genes.
DNA methylation is a powerful mechanism for gene silencing. Genes which are methylated and repressed in vertebrate organisms are effectively shut off. For example, hemoglobin genes should be methylated (and silent) in skin cells but unmethylated (and actively expressed) in red blood cell precursors.
The generation of cancer requires multiple changes in gene structure and function. A single mutation is not enough to transform a normal cell into a cancerous cell; between three and seven mutations or other genetic insults are required [9,10]. The affected genes—those which are involved in the progression from normal cell to cancer cell—fall into two different categories: oncogenes and tumor suppressors. Oncogenes must be activated, and tumor suppressors must be inactivated, in order for cancer to develop. Tumor suppressors may be inactivated through mutation or through gene silencing.
Mutations of tumor suppressors were identified first. In hereditary retinoblastoma, a cancer of the retina, affected children have inherited a defective copy of the gene for the tumor suppressor Rb from one of their parents. Tumors often arise in both eyes of these children following the loss of the second copy of Rb (inherited from the other parent) [11]. Additional tumor suppressor gene mutations were discovered by examining the DNA of many different tumors and comparing that DNA to that of normal tissue. For example, the tumor suppressor p16, a cell cycle control protein, was found to be deleted or to have suffered mutations in many different types of cancer [12]. Tumor suppressor genes do not need to be mutated to contribute to the genesis of cancer, however. They merely need to be silenced. If the CGIs in the promoters of p16 or Rb genes are inappropriately methylated and the genes are turned off, then the cell in question will be one step closer to the development of a cancer.
CGIs were first characterized quantitatively 25 years ago by Gardiner-Garden and Frommer [13] and their definition is still widely in use today. The terms percent combined content and observed over expected CpG ratio introduced by the authors provide a way to identify genomic regions with higher frequencies of C and G nucleotides and CpG dinucleotides. The of a sequence is the fraction of the combined number of s and s in the sequence divided by the total number of nucleotides in the sequence. To define , we note that if dinucleotides in a DNA sequence were formed by random independent choices of two nucleotides, the expected number of CpG dinucleotides in a sequence of length l would be
The observed would be the actual count of dinucleotides found in the sequence of length l. The observed over the expected ratio is defined as the ratio of these two numbers (and, unlike the quantity , may assume values greater than 1).
In the original study published in 1987 [13], Gardiner-Garden and Frommer defined CGIs in the vertebrate genome as sequences that have: (1) length of at least 200 bp, (2) , and (3) . This definition is still commonly used today but it serves more as a guideline since there is no universal standard for the cutoff values. For instance, Takai and Jones [8] used a more stringent criterion to analyze CGIs in human chromosome 21 and 22: (1) length , (2) , and (2) motivated by reducing the number of CGIs found within Alus.2
Algorithms for extracting CGIs often utilize a sliding windows approach that has been implemented by many web-based software systems including CpGPlot/ CpGReport [14], CpGProd [15], CpGIS [8], and CpGIE [16]. The method calculates the and for subsequences of fixed length l that differ from one another only by 1 bp (the new subsequence is offset by 1 bp to the right from the previous one). One can visualize the process as sliding a “window” of length l along the genome. If the subsequence in the window meets the specific cutoff values for and , it will be included in a (possibly larger) island region. The details of the specific algorithm implemented by CpGIS are shown in Figure 9.4. An animated version of a sliding windows algorithm is available in the CpG Educate suite that has been developed for this chapter and is available at http://inspired.jsu.edu/~agarrett/cpg/.
Figure 9.4 Schematics for the sliding window algorithms for CGI extraction from human genome sequences from [8]. (A) Set a 200-base window in the beginning of a contig (sequence), compute content and ratio. Shift the window 1 bp after evaluation until the window meets the criteria of a CGI. (B) If the window meets the criteria, shift the window 200 bp and then evaluate again. (C and D) Repeat these 200 bp shifts until the window does not meet the criteria. (E) Shift the last window 1 bp toward the end until it meets the criteria. (G) Evaluate content and ratio. (H) If this large CGI does not meet the criteria, trim 1 bp from each side until it meets the criteria. (I) Two individual CGIs were connected if they were separated by less than 100 bp. (J) Values for and content were recalculated to remain within the criteria. Reprinted with permission from Takai, D., Jones, P. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. PNAS 2002 99 (6) 3740–3745. Copyright (2002) National Academy of Sciences, USA.
The project Investigating Predicted Genes available online from the volume’s website as part of this chapter utilizes sliding windows software to search for the presence of CGIs in the vicinity of predicted genes. The existence of a CGI in the area of the sequence where the promoter for the predicted gene should be found would be an additional piece of evidence suggesting that the predicted gene may be, in fact, an actual gene.
Sliding windows algorithms are not based on any specific assumptions of structural mechanisms (mathematical or biological) that can explain the differences in density between the island and non-island regions. As such, they do not utilize any mathematical models, theory, or specialized tools to make the questions of CGI identification more tractable. Sliding windows algorithms often differ from one another not only in the choice of the threshold values for the length, , and ratio for a sequences to be recognized as a CGI but also in whether islands that are separated by small gaps would be merged if they still meet the cutoff criteria. In addition, some CGI searcher sites use a modified criterion for CGIs, requiring that the and thresholds are met over an average of several windows (e.g., EMBOSS uses an average of 10 windows). For the same DNA sequences, these differences in the algorithms would generally lead to different regions identified as CGIs. More importantly, it has been shown that the sliding windows method does not guarantee an exhaustive search and that it may fail to identify all regions on the genome that meet the established criteria [17]. The use of alternative methods for CGI identification, such as Hidden Markov Models (HMMs) [18] and clustering methods [19–22] would therefore be preferable.
For the rest of this chapter we consider the HMM approach to locating CGIs. Such models are based on assumptions about the distribution of the nucleotides and di-nucleotide in the genome and provide a convenient mathematical framework within which the question of locating the island regions translates into well-understood mathematical problems. We begin with some examples.
The frequencies in Tables 9.1 and 9.2 present an example of nucleotide frequencies obtained from a sequence of annotated human DNA of about 60,000 nucleotides with known locations and lengths of the islands [23]. There are notable differences in the distributions, in agreement with the expectations that island regions would have elevated content and higher frequencies of the dinucleotide. For unannotated sequences those frequencies would be unknown and we would want to have a way of obtaining some estimates for those from the data. This may initially seem to be an impossible task. After all, how can we estimate those frequencies when the boundaries between the regions are unknown? As we will see below, there is a way to solve this problem, if we view the DNA data as generated by a known underlying mechanism, formally described by a HMM.
Table 9.1
Sample dinucleotide frequencies (from [23]). The first row represents the frequencies of the transitions from A to A, C, T, and G in island and non-island regions and similarly for the other rows. Note that G is a lot more likely to follow C in island regions. The transition frequencies have been computed from annotated DNA as follows. If stands for the transition frequency (transition probability) from letter to letter in the island region, where , then is computed as the ratio , where is the number of times the letter followed by the letter in the annotated island regions. The transition probabilities for the non-island regions are computed in the same way.
Table 9.2
Sample nucleotide frequencies from [23]. The entries are computed by averaging the columns of Table 9.1 for the (“+”) and for the (“−”) regions.
The rest of the chapter is organized as follows: Section 3 contains a brief review of Markov chains and HHMs, presenting some examples, highlighting some main concepts, and introducing the notation. Readers without prior experience with Markov chains or HMMs may need to consider a more detailed introduction to these topics such as [25,26]. In Section 4 we present CGI identification methods based on HMMs. The section focuses on the questions of evaluation, decoding, and parameter estimation for HMM. Some final comments and notes on generalizations and ongoing work involving HMMs for CGI identification are gathered in Section 5. A suite of web applications, CpG Educate, has been developed by the authors to illustrate the algorithms and is used for many of the examples and exercises in the chapter. CpG Educate is freely available at http://inspired.jsu.edu/~agarrett/cpg/.
A Markov chain is a “process” that at any particular time is in one of a finite number of “states” from a set . We will only consider discrete time, which means that the process can be visualized as running according to a digital timer, possibly changing states at each time step. The Markov property assumes the state of the process at the next step depends only on its present state and not on the history of how the process arrived at that state. Thus, the process has no memory beyond the “present” and the only factors that govern the process are (i) its current state, and (ii) the probabilities with which the process moves from its current state to other states.
Assume that denotes the state of the process at time t, where As time goes on, the process transitions from one state to another generating a path (the number l is the length of the path). If we have observed the path of the process up to time t, the formal definition of the Markov property is that
stating that the probability for the process transitioning to at time does not depend on the entire “history” of the process but only on its current state .
For any two states , we consider the transitional probabilitiy , defined as the probability that the process moves from state i to state j when the (discrete) time changes from t to . When is the probability that the process remains in state i at time . The transition probabilities are the same for all values of t, meaning that the transitions depend only on the state of the process and not on when the process visits that state. The transition probabilities are commonly organized in a transition matrix
with .3 The initial state for the process is determined by the initial distribution , where .
It is often convenient to introduce notation that would allow for the initial distribution and for the ending of the sequence to be treated as transition probabilities and included in the notation, . To accomplish this, a hypothetical “beginning” state B and an “ending” state E are introduced with the assumption that the process begins at state B at time with probability 1 and it transitions to E with probability 1 at the end of each path. The probability for transitioning into B after time is zero, and the probability of transitioning out of E is zero. With these additions, each path of the Markov chain can be expanded to . We can append a superficial state denoted by 0 to and write B) for the initial distribution, for all , and E , for all (at the end of each path , the process moves to E with probability 1). In what follows we will not explicitly append the symbols B and E at the beginning and at the end of all paths but, when it is necessary, the transition probabilities will be interpreted in this generalized sense.
For any observed path , we apply the Markov property to compute its probability as follows:
We can now easily compute the probability of any path of this process. For example, if and , we obtain and .
HMMs generalize Markov chains by assuming that the process described by the Markov chain is not readily observable (it is hidden). According to some rules, each hidden state generates (emits) a symbol and only the sequence of emitted symbols is observed. The following example, inspired by the Occasionally Dishonest Casino example by Durbin et al. [23], illustrates the idea.
The player records wins and losses from consecutive games in a sequence of s and s (e.g., ), where W stands for a win and L stands for a loss. This way, for each sequence of games, the player generates a record of wins and losses with , denoting the outcome of game . The chances of winning or losing a game depend on the choice of die used for that game. We can think of each as generated (emitted) by the hidden state with a certain probability. Since a fair die gives the player a win with probability and a loss with probability , we will say that the fair die emits a W with probability and that it emits an with probability . Similarly, the unfair die emits a W with probability and an L with probability . The set is the set of emitted states for the hidden process.
The main difference between Markov chains and hidden Markov chains is that the observed sequence x cannot be mapped to a unique path of state-to-state transitions for the hidden process. Multiple hidden paths can generate x and, as our next example illustrates, they do so with different probabilities.
With this background, we can now ask the following questions: Given an observed sequence , how do we determine the hidden sequence that maximizes the probability of observing x? Can we estimate the parameters of the HMM (the transition matrix P and the emission probabilities) from the observed sequence x?
The main reason for considering these gambling examples is that the problem of locating CGIs is in many ways mathematically analogous and may be stated similarly. A DNA sequence x composed of the symbols A, C, T, and G may be viewed as generated by a mechanism switching between two hidden states , analogous to the honest and dishonest dice. One state is that of CGIs (the “+” region), the other is the non-island region (the “” region). The states A, C, T, and G are the same for both regions, but the transition matrices and/or the emission probabilities will be different.5 We cannot observe the state-to-state transitions of the process directly but we observe the sequences emitted by the process where each is a nucleotide from the set . The questions above for the casino games are now immediately translated into the following questions: Given a long DNA sequence , how do we determine which hidden path is the most likely to have emitted this sequence? Can we estimate the transition and the emission probabilities of the HMM from the observed sequence? To examine these questions in detail, we first need to introduce the general notation for HMMs.
In general, a HMM consists of a hidden Markov chain with a finite state space Q and a finite set of emission symbols M. The state of the hidden process at time is denoted by . The transition probabilities are , and the initial distribution is . The process emits a symbol from each of the hidden states that it visits. The emission probabilities are denoted by , where (since each of the hidden states emits exactly one symbol) . The set of transition, emission, and initial probabilities forms the set of parameters of the HMM.
A convenient way to summarize the parameters of a HMM is to organize them in a table where the transition matrix, emission probabilities, and the initial distribution are given in the columns and where the rows are labeled by the hidden states. Table 9.3 contains the parameters of the HMM from Example 9.3.
Table 9.3
The parameter set for the HMM from Example 9.3 in tabular form.
Tables 9.1 and 9.2 demonstrate two different ways of viewing the quantitative differences between the “” and the “” regions in the genome, each one of which can be used to construct a HMM. When we look only at the nucleotide frequencies as in Table 9.2 we can consider a HMM with a state space , where each of these states can emit a symbol from the set with emission probabilities as those in Table 9.2. Assuming that hidden process transitions between the “” and “” states are as in Figure 9.5 (where in this case, we will identify the state U with “” and the state F with “”) the parameters for the HMM will be those in Table 9.4.
Table 9.4
A set of probabilities (parameters) of a HMM for a DNA sequence where the model is only concerned with the frequencies of the individual nucleotides. The transition matrix of the HMM is under the “Transitions” heading. Each of the hidden states emits a symbol from the set with emission probabilities listed under the “Emissions” heading. The hidden process is equally likely to begin in the “+” and “−” state, as stated under the “Initial Distribution” heading.
If we want the model to incorporate information about dinucleotides, as in the case of Table 9.1, the set of emitted symbols is again but now the emission events at each step are not independent from one another. If, say, the process is in the hidden state “,” the probability for emitting a symbol C will depend upon the symbol emitted by the previous state and whether this symbol was emitted from the “” or from the “” hidden state. We can think of it as emitted from one of two hidden states or . Thus, for each of the emission symbols we should have states and in Q, leading to a state space for the hidden process. The matrix for the transitions within the subsets of the “” and “” states should be close to those in the transition matrices in Table 9.5 but switching between the “” and “” subsets and of Q should also be allowed with some small probability. Table 9.5 presents this scenario.
Table 9.5
The set of parameters of a HMM describing a DNA sequence. The model incorporates information about dinucleotide frequencies. Here the block matrices and are the “+” and “−” transition matrices from Table 9.1 The process remains in the the “+” and “−” region with probabilities and , respectively. When switching between the “+” and “−” regions, this model assumes that all states in the new region are equally likely. For each of the emission symbols , the model assumes that only the states and from can emit . The initial distribution is uniform.
The parameters in Table 9.5 present a very special case among all possible HMMs with and and we have included it here since it provides a natural way to include the information from Table 9.1. Ultimately, though, the parameters of the HMM would need to be estimated from the DNA data, and it would be more appropriate to consider the model in its full generality. Thus, any transition matrix P under the Transitions heading, any emission matrix E under the Emissions heading, and any initial distribution could be used as model parameters. In [27], the authors consider such a general model and estimate the parameters from a set of 1000 DNA sequences.
We now turn to the mathematical solutions of the questions raised above and describe how they can be solved in a computationally efficient way. After discussing each problem, we provide examples and exercises with connections to CGI identification. We begin with restating the questions in the context of HMMs:
1. Decoding Problem: Given an observed path generated by a HMM with known parameters, what is the most likely hidden path to emit x? In other words, how do we find ?6 From Example 9.3 we know how to compute for any hidden path and it can be shown (see Exercise 9.6 below) that
Since there are finitely many hidden paths, one may be tempted to answer the question by computing for all paths , compare their values, and pick the largest. Such a “brute force” approach is not practical, however, since the number of all hidden paths grows exponentially with the length of the paths l. The number of all such paths is astronomical for large values of l, making the task impossible even for the fastest computers.7
2. Evaluation Problem: Given an observed path x, what is the probability of this path ? Mathematically, this probability can be expressed as
(9.1)
where the sum is over all possible hidden sequences with and where
(9.2)
As with the decoding problem, the mathematical solution from Eq. (9.1) has no practical value since the number of such paths grows exponentially as the length of the path l grows.
3. Training (Learning) Problem: Given an observed sequence x or a set of observed sequences, what are the HMM parameters that make the sequence x most likely to occur? The answer to this question provides estimates for the parameters of the HMM from a data set of observed sequences.
The Viterbi algorithm is a recursive and computationally efficient method for computing the most likely state sequence for a given observed sequence (see [28] and also [29] for an interesting account of the history by Andrew Viterbi himself).
To understand the recursive step, let’s assume that for any state we have somehow computed the hidden sequence of highest probability among those emitting the observations up to time with . That is, we assume that for each , we have determined the most probable path of length for the data , ending in state j. Denote the probability of this path by :
Next, we can use to compute the most probable path of length t ending in each of the states and emitting the sequence :
The probability of the most likely path of length t ending in k and emitting will be the largest among the probabilities for hidden sequences that get into j at time with a maximal probability, transition from j to k at time t, and emit . Thus
Iterating this argument for all will allow us to compute the probability of the most likely path for the data . To be able to recover the path itself, for each time t and for each state , we keep pointers (ptr) to remember the state from which the maximal probability path ending in k came. For each and we record k’s predecessor , where . The notation means that the path with the highest probability ending in state k at time t came from state r at time .
Utilizing the agreement that the process always starts in the beginning state B, Viterbi’s algorithms can be summarized as follows. To find the most probable path for the observed data , perform the following steps:
• Recursion (repeat for : where .
• The maximal probability path can now be found by backtracking through the recorded pointers.
Our next example illustrates the method.
It is convenient to visualize the Viterbi algorithm by using a trellis diagram that makes it easier to follow the process. In a trellis diagram, the columns are labeled with the values of and the rows correspond to the states of the Markov chain (see Figure 9.6). Starting from the state B and following the arrows can generate all possible paths . The number under each node is the maximal probability for the path ending in the node and agreeing with the observed data up to that time (these are the -values computed by the Viterbi algorithm). The solid arrow into each node is the pointer that keeps track of the origin of the maximal probability path. Locating the largest probability in the last column and backtracking using the solid lines, generates the maximal probability path .
Figure 9.6 Trellis diagram for Example 9.5. B is the beginning state (the ending state is not pictured). The numbers under each node indicate the maximal probability for all paths ending in this node and emitting sequences in agreement with the data. The dashed arrows indicate possible transitions. For each node, the solid arrow into the node is the pointer indicating where the maximal probability path comes from.
It can be shown that the computational complexity of the Viterbi algorithm is no worse than in time8 and in space, where is the number of states and is the length of the sequence. Thus, the complexity of the algorithm is quadratic in time, which, although still computationally demanding, is a huge improvement over the exponential rate of the brute force approach. Since the Viterbi algorithm involves multiplying many probabilities, the results could get very small, resulting in essentially zero probabilities and causing underflow problems. To avoid this, all calculations are performed on the logarithms of the probabilities , using addition instead of multiplication.
Obviously, Example 9.5 shows that performing the Viterbi computations by hand is tedious even for very short paths. From now on, we will use the CpG Educate suite that implements the Viterbi algorithm for HMMs corresponding to the Dishonest Casino problem from Example 9.4 and for the CGI identification problem. The applications in the CpG Educate suite have the capability to generate an output from a HMM and record the hidden path that produced the emitted sequence. This simulated data can then be used to test the “accuracy” of the Viterbi algorithm by comparing the predicted states from the Viterbi decoding with the actual ones from the simulations.
When using a HMM for identifying CGIs, the outcome of the Viterbi algorithm will produce the predicted path through the hidden “+” and “” states. The start of the island will be where the predicted hidden sequence switches from the subset to the subset and the end of the island will be where the sequence switches back from the “” to the “” states.
In this section we outline a computationally efficient algorithm for computing , the probability for an observed sequence . We also discuss an alternative decoding method called posterior decoding.
The forward algorithm is similar in idea to the Viterbi algorithm. Instead of keeping track of the hidden sequences of maximal probability in state at time , the forward algorithm computes the probabilities of being in state , having observed the first symbols of the sequence . Denote these probabilities by .Recursively, as in the Viterbi algorithm, we can now compute
the probability that the process is in state at time with emitted sequence :
The rationale is straightforward. For the process to be in state at time with emitted sequence , it must be in one of the states at time with emitted sequence (with probability , then transition to state at time (with probability and emit (with probability . The probability is then the sum of the product of those probabilities over all states .
At time t = 0, the process is in the beginning state with probability 1 and has emitted no output sequence yet. Thus, we initialize the algorithm by . The complete algorithm is:
Solution:
Finally, we have .
For our main problem of identification, the probability of a DNA sequence of nucleotides is not of much interest by itself but we will need it to compute the posterior probabilities , that symbol in the observed sequence was emitted from state , which can then be used for decoding.
Since
(9.3)
we need a recursive algorithm for computing . The Markov property of the hidden process makes this possible:
(9.4)
The first line is a direct application of the conditional probability formula where the probability that is generated with at time is given as the product of the probabilities of the following events: (1) symbols are emitted up to time and the process is in state at time , and (2) conditioned upon the event (1), the rest of the emitted sequence is . The second line follows from the Markov property of the hidden process and restates that the probability to emit the sequence depends only on the state of the process at time .
Notice that the are exactly the probabilities , which are computed from the forward algorithm. Denote . Equation (9.4) can now be re-written as
(9.5)
The probabilities are computed by the backward algorithm. We begin by initializing the algorithm for where, since is the last observed symbol, is the end state (that does not emit a symbol). Thus , since the sequence goes to the end state E with probability 1.
Once we know for all , for any of the values we can compute
The justification is as follows: At time the process can transition from to any other state (this happens with probability , emit (this happens with probability , and, being in state at time , emit the rest of the sequence (which happens with probability .
Since at time the process is in the beginning state with probability 1, . This shows that the probability can also be computed from the backward algorithm as . If this probability is already computed from the forward algorithm, the backward algorithm will terminate once are computed for all .
The complete algorithm is:
Combining now Eqs. (9.3) and (9.5), we obtain the following equation for the posterior probabilities
(9.6)
Since we use both the forward and the backward algorithms, we say that the posterior probabilities are computed by the forward-backward algorithm.
The posterior probabilities computed here can be used to complement the results from the Viterbi decoding or as a possible alternative decoding method referred to as posterior decoding. They could prove useful when there are multiple sequences with probabilities close to the maximal probability sequence(s) generated by the process of Viterbi decoding, in which case it may not be justified to only consider the sequence(s) of maximal probability. The posterior probabilities give the likelihood (based on the entire observed sequence that the symbol in position has been emitted by the hidden state . Posterior decoding can be quite useful for decoding a hidden process with two states (or two groups of states), which is exactly the case we are concerned with in this chapter. To see this, in the case of the Dishonest Casino example, given the sequence , we plot the probabilities for each . The “hills” in the resulting graph would indicate segments in the sequence that are likely to be emitted from the state. The remaining segments are more likely to have been emitted by the state. Analytically, the decoded sequence will be
(9.7)
Figure 9.8 exemplifies this approach for a simulated sequence of 500 symbols. The gray areas highlight the time steps at which the unfair die was used for the simulations. The condition from Eq. (9.7) in the case of only two hidden states means that if we consider a threshold of 0.5 for Figure 9.8, the hidden states with posterior probabilities plotted above the horizontal line at 0.5 will be predicted as states. The overlap is not perfect, of course, but, as our next exercise shows, the separation between the states can be even better for hidden processes that only switch between states with very small probability.
Figure 9.8 A plot of the posterior probabilities of being in a state generated by the unfair die for a sequence of 500 simulated runs. The gray highlight identifies the runs obtained from the simulation by using the unfair die. The transition probabilities of the HMM are , and . The default values for the emission probabilities (chosen to match those from Example 9.4) were used.
For all computations until now, we always assumed the parameters of the HMM are known. Those parameters, however, are usually only initial estimates and, in most cases, we may not even have those estimates to begin with. In this section we will discuss how to estimate the HMM parameters from the data. This process is called training (or sometimes learning). The algorithm presented here was first introduced in [31] (see also [32]). We begin with an observed sequence or a set of observed sequences for which we would want to adjust the HMM model parameters to ensure the best possible fit. Once a set of parameters is determined for those sequences, we apply the Viterbi algorithm or use the posterior probabilities to other similar sequences for decoding. The sequences are called training sequences and we say that we will train the model to best fit those sequences.
More formally, this means that given the data and the HMM, we need to find the values of the model parameters (the transition probabilities and the emission probabilities for and that maximize the probability of the data . If we use to refer to the whole set of parameters for the model, in this section we will sometimes write to emphasize that the likelihood of the data depends on the set of parameters and on the model. The goal is to find
If the training sequences are annotated and we know the exact paths of the process emitting the training sequences, the estimates for the model parameters will be computed by determining the frequencies of each transition and emission. Assume that is the number of transitions from to in the set of training sequences and is the number of times the state emitted the symbol . Then the frequencies
(9.8)
are maximum likelihood estimates for the HMM [23].
When the paths that generate the training data are unknown, we can no longer determine the counts and but they can be replaced with the expected counts for each transition/emission in the HMM. Obtaining the maximum likelihood estimates in this case can be done by an iterative process known as the Baum-Welch algorithm [31]. This method is no longer guaranteed to find a global maximum for the probability of the data but the iterative steps will converge to a local maximum. The Baum-Welch algorithm is a special case of a more general class of methods known as Expectation Maximization algorithms or EM algorithms. Below we will outline the computational aspect of the Baum-Welch method but will omit the theoretical proofs of convergence to a maximum. Those proofs together with the general theory of the EM algorithms can be found in [33]. In the description of the Baum-Welch algorithm below we will assume that the HMM is being trained on a single data sequence . We will comment on the straightforward generalization to multiple sequences afterwards.
The Baum-Welch algorithm generates a sequence of approximations for the set of HMM parameters, with each new set of parameters improving the value of over the previous iteration. The process terminates when two successive iterations produce the same values for or values that are closer than any previously chosen tolerance value.
To initialize the Baum-Welch algorithm, choose any set of model parameters, incorporating any prior information that may be available. Absent such information, a uniform or any other arbitrary distribution may be chosen. Denote the set of those initial parameters by . To obtain improved estimates, we still want to use Eq. (9.8) but this time the counts and will be replaced with the expected counts computed from the data. To compute those expected counts, we will first compute —the probability that the hidden process will transition from state to state at time .
The inclusion of the parameter set in the notation indicates that these probabilities will be computed based on the initial set of parameters . We will omit this from the notation from now on for simplicity but all of the expressions below assume that we use this parameter distribution for the computations. Using conditional probabilities, we obtain
(9.9)
the last equation following from the Markov property.11 We have also used the notation , introduced earlier for the forward algorithm.
The probability can further be expressed as
where, as in the backward algorithm, we use . The justification is as follows: Once the process is in state at time , for the event to take place, the process needs to transition into , emit the symbol , and generate the rest of the sequence . Combining this with Eq. (9.9), we obtain
(9.10)
The average number of transitions from to in the training sequence will then be the sum of the probabilities of this transition occurring exactly at position , over all positions in the sequence :
(9.11)
In a similar way, since the posterior probabilities can be computed from the forward-backward algorithm as (see Eq. (9.6)), the average number of states emitting the symbol will be
(9.12)
Next, the expected counts from Eqs. (9.11) and (9.12) are substituted in the Eqs. (9.8), generating improved estimates for the transition and emission probabilities. This is the set of parameters that we have denoted by . Now, repeating the computations given by Eqs. (9.11) and (9.12) with the new values of the parameters from the parameter set and substituting those values in Eqs. (9.8) will generate the set of parameters and so on. The process is guaranteed to increase the likelihood of the data, that is, for the sequence of parameter approximations [33]. The process terminates when two successive values are either identical or sufficiently close.
As with the other algorithms we have considered in this chapter, to avoid underflows from multiplying small probabilities, the computations are usually carried out in logarithm space. If several sequences are used for training, Eqs. (9.11) and (9.12) should be modified to include the sum of the expected counts from all sequences:
(9.13)
where the superscripts indicates that the respective probability is computed for the sequence .
The decoding methods described in this section are purely mathematical and they may not produce completely accurate results when applied to CGIs. Both Viterbi and the posterior decoding methods impose no restrictions on the length of the identified islands or check whether biologically important conditions such as high content or high ratio (see Section 2) are met. When HMMs are used for CGI identification, the consideration of these properties is done during the post-processing stage. At this stage we turn back to the genomic properties of the CGIs that have not been modeled by the HMM. This stage usually includes performing one or more of the following refinements:
– Combine CGIs separated by short gaps: Neighboring CGIs that are separated by small gaps of non-island regions are merged into a single larger island. A minimal distance threshold between islands is set in advance and neighboring islands closer than this threshold value are merged. The selection of the threshold values used in the reported literature varies from about 15–20 [27] to up to 100 [8].
– Check for minimal content and ratio: Check to see if the islands identified by the decoding methods meet the biologically relevant thresholds for content and as described in Section 2. If the identified CGIs do not meet those threshold value requirements, those states will be relabeled as non-islands.
– Check for minimal length: As discussed in Section 2, short sequences labeled as CGIs are not of biological interest. Different length-threshold values are used in the literature but those are usually in the range 140–500 bp [8,27]. If the length of a predicted CpG island is less than the threshold value, those states will be relabeled as non-island.
Post-processing is then applied to filter out the regions that do not meet the biological criteria for CGIs with cutoffs.
Post-processing is then applied to filter out the regions that do not meet the biological criteria for CGIs with cutoffs as follows: If the distance between neighboring islands is less than 20 bp, the regions are merged. The newly extended CGIs and all other predicted islands are then tested to be of length 140 bp, have , and . If those combined conditions are not met, the states previously recognized as CGIs will be relabeled as non-islands. Table 1 in [27] gives the final results for several test sequences used by the authors and compares those with results obtained from using other CGI locating systems.
CGIs are of great interest in genomic analysis and are often used as markers for cancer and gene identification, as well as to investigate methylation profiles [34,35]. Following the original definition and algorithm for CGIs identification by Gardiner-Garden and Frommer [13], a number of sliding window algorithms have been developed with the ability to implement different interpretations and cutoff values for , , the island length, gaps between the islands, and size of the sliding window [8,14–16]. It has been shown, however, that such methods do not provide an exhaustive search and that they may miss a large percentage of CGIs [17]. In addition, since sliding windows do not make use of any underlying mathematical structure, alternative approaches based on mathematical models are preferable in order to make some of the problems arising in the context of CGI identification more standardized and tractable.
In this chapter we focused mainly on the use of HMM for CGI identification. Viewing DNA data as output from a HMM elucidates the search for CGIs by placing the problems within a well-developed mathematical framework where efficient methods for decoding, evaluation, and training are readily available. In general, HMMs are widely used to analyze sequences and time series of data under the assumption that they have been generated by a process that cannot be directly observed. Instead, information about the process must be inferred from the observed sequences. In addition to CGI identification, this broad-spectrum approach has also been used for speech recognition [36], protein modeling [37], peptide sequencing [38], multiple DNA sequence alignment [39], gene prediction [40], and many others.
The models introduced in this chapter can also be used as a basis for further extensions and improved HMMs for CGI identification. Some work in this direction includes the development of an extensible approach that summarizes the evidence of CpG islands as probability scores [41], a hybrid visualization HMM method [18], a modified HMM approach with Poisson emission probabilities [42], HMM approaches to improving the power of pattern detection [43], improved performance Viterbi and EM algorithms [44], and methods for speeding up HMM decoding [45].
The chapter includes a number of exercises of both applied and theoretical nature and an online project Investigating “Predicted” Genes available from the volume’s website. We encourage the reader to attempt each exercise/project as soon as the relevant material for its execution has been introduced. Most of the applied exercises use the CpG Educate suite of web applications developed for this chapter and available at http://inspired.jsu.edu/~agarrett/cpg/[30]. CpG Educate uses the General Hidden Markov Model library (http://ghmm.org/) for the implementation of all HMM algorithms. A key feature of CpG Educate is that it provides the option to simulate data for most of the HMMs used in the chapter and compare the results from decoding and training with those from the simulations. After completing the chapter exercises, we invite the reader to undertake further experimentations with the decoding and learning methods for various simulated and actual data.
Authors Davies, Kirkwood, and Robeva gratefully acknowledge the support of the National Science Foundation under DEU award #0737467.
A supplementary project and additional files and data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/B978-0-12-415780-4.00019-3 and from the volume’s website http://booksite.elsevier.com/9780124157804.
1. Bird A. DNA methylation patterns and epigenetic memory. Genes Dev. 2002;16:6–21.
2. Klose RJ, Bird AP. Genomic DNA methylation: the mark and its mediators. Trends Biochem Sci. 2006;31:89–97.
3. Sorensen AL, Timoskainen S, West FD, et al. Lineage-specific promoter DNA methylation patterns segregate adult progenitor cell types. Stem Cells Dev. 2010;19:1257–1266.
4. Isagawa T, Nagae G, Shiraki N, et al. DNA methylation profiling of embryonic stem cell differentiation into the three germ layers. PLoS One. 2011;6:e26052.
5. Collas P. Programming differentiation potential in mesenchymal stem cells. Epigenetics. 2010;5:476–482.
6. Neddermann P, Jiricny J. The purification of a mismatch-specific thymine-DNA glycosylase from HeLa cells. Journal Biol Chem. 1993;268:21218–21224.
7. Straussman R, Nejman D, Roberts D, et al. Developmental programming of CpG island methylation profiles in the human genome. Nature Struct Mol Biol. 2009;16:564–571.
8. Takai D, Jones PA. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc Natl Acad Sci USA. 2002;99:3740–3745.
9. Ashley DJB. The two hit and multiple hit theories of carcinogenesis. Br J Cancer. 1969;23:313–328.
10. Renan MJ. How many mutations are required for tumorigenesis? Implications from human cancer data. Mol Carcinog. 1993;7:139–146.
11. Schappert-Kimmijser J, Hemmes JGD, Nijland R. The heredity of retinoblastoma. Ophthalmologica. 1966;151:197–213.
12. Noburi T, Miura K, Wu DJ, Lois A, Takabayashi K, Carson D. Deletions of the cyclin dependent kinase-4 inhibitor gene in multiple human cancers. Nature. 1994;368:753–756.
13. Gardiner-Garden M, Frommer M. CpG Islands in Veribrate Genome. J Mol Biol. 1987;196:261–282.
14. Rice P, Longden I, Bleasby A. EMBOSS: The European Molecular Biology Open Software Suite. TIG. 2000;16:276–277.
15. Ponger L, Mouchiroud D. CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences. Bioinformatics. 2002;18:631–633.
16. Wang Y, Leung FCC. An evaluation of new criteria for CpG islands in the human genome as gene markers. Bioinformatics. 2004;20:1170–1177.
17. Hsieh F, Chen SC, Pollard K. A nearly exhaustive search for CpG islands on whole chromosomes. Int J Biostatistics. 2009;5 Art. 14.
18. Rambally G, Rambally R. A hybrid visualization Hidden Markov Model approach to identifying CG-islands in DNA sequences. In: Southeastcon. IEEE 2008;1–6. 3–6, 2008.
19. Hackenberg M, Previti C, Luque-Escamilla P, Carpena P, Martinez-Aroza J, Oliver J. CpGcluster: a distance-ased algorithm for CpG-island detection. BMC Bioinform. 2006;7:446.
20. Hackenberg M, Barturen G, Carpena P, Luque-Escamilla PL, Previti C, Oliver JL. Prediction of CpG-island function: CpG clustering vs sliding-window methods. BMC Genom. 2010;26:327.
21. Sujuan Y, Asaithambi A, Liu Y. CpGIF: an algorithm for the identification of CpG islands. Bioinformation. 2008;2:335–338.
22. Chuang LY, Huang HC, Lin MC, Yang CH. Particle swarm optimization with reinforcement learning for the prediction of CpG islands in the human genome. PLoS One. 2011;6:e21036.
23. Durbin R, Eddy S, Krogh A, Mitchison G. Biological sequence analysis Probabilistic models of proteins and nucleic acids. Cambridge, UK: Cambridge University Press; 1998.
24. Pahter L, Sturmfels B. Algebraic statistics for computational biology. Cambridge, UK: Cambridge University Press; 2005.
25. Norris JR. Markov chains. Cambridge: Cambridge University Press; 1997.
26. Elliot JR, Aggoun L, Moore JB. Hidden markov models: estimation and control (corrected 3rd printing). New York: Springer; 2008.
27. Lan M, Xu Y, Li L, Wang F, Zuo Y, Tan CL, et al. CpG-Discover: a machine learning approach for CpG island identification from human DNA sequence. In: Proceedings of international joint conference on neural networks, Atlanta, Georgia, USA; June 14–19, 2009. p. 1702–07.
28. Viterbi A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inform Theory. 1967;IT-13:260–269.
29. Viterbi A. A personal history of the viterbi algorithm. IEEE Signal Process Mag. 2006;23:120–142.
30. Garrett A. CpG EducateSoftware tutorial; 2012. http://inspired.jsu.edu/agarrett/cpg/CpGEducate.pdf.
31. Baum LE, Petrie T, Soules G, Weiss NA. Maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Stat. 1970;41:164–171.
32. Welch LR. The Shannon lecture: hidden Markov models and the Baum-Welch algorithm. IEEE Inform Soc Newslett. December 2003;53.
33. McLachlan G, Krishnan T. The EM algorithms and extensions. 2nd ed Hoboken, NJ: Wiley; 2008.
34. Illingworth PR, Bird AP. CpG islands – a rough guide. FEBS Lett. 2009;583:1713–1720.
35. Bobbie PO, Reams R, Suther S, Brown CP. Finding molecular signature of prostate cancer: an algorithmic approach. In: Proceedings of the 2006 international conference on bioinformatics & computational biology, BIOCOMP’06, Las Vegas, Nevada, USA; June 26–29, 2006. p. 265–9.
36. Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE. 1989;77:257–285.
37. Krogh A, Brown M, Mian IS, Sjlander K, Haussler D. Hidden Markov models in computational biology Application to protein modeling. J Mol Biol. 1994;235:1501–1531.
38. Fischer B, Roth V, Roos F, et al. NovoHMM: a hidden Markov model for de novo peptide sequencing. Anal Chem. 2005;77:7265–7273.
39. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. 2005;15:330–340.
40. Bernal A, Crammer K, Hatzigeorgiou A, Pereira F. Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput Biol. 2007;16:e54.
41. Wu H, Caffo B, Jaffee HA, Irizarry RA, Feinberg AP. Redefining CpG islands using hidden Markov models. Biostatistics. 2010;11:499–514.
42. Irizarry RA, Wu H, Feinberg AP. A species-generalized probabilistic model-based definition of CpG islands. Mamm Genome. 2009;20:674–680.
43. Zhai Z, Ku SY, Luan Y, Reinert G, Waterman MS, Sun F. The power of detecting enriched patterns: an HMM approach. J Comput Biol. 2010;17:581–592.
44. Lam T, Mayer A. Efficient algorithms for training the parameters of hidden Markov models using stochastic expectation maximization (EM) training and Viterbi training. Alg Mol Biol. 2010;5:38.
45. Lifshits Y, Mozes S, Weimann O, Ziv-Ukelson M. Speeding up HMM decoding and training by exploiting sequence repetitions. Algorithmica. 2009;54:379–399.
1Those include red blood cells, which have no DNA, the lymphocytes of the immune system, in which DNA has been rearranged, and gametes, which have half of the adult’s DNA.
2Alu sequences (named for the restriction endonuclease AluI, which cuts in these sequences) are short repetitive sequences with a relatively high content and CpG ratio.
3This condition simply reflects the fact the process will be in some state from Q at the next time step, unless it terminates.
4We will show how this can be done in a computationally efficient way in Section 4.
5For example, within the “” and the “” regions, the transition probabilities could be similar to those in Table 9.1 (with the modification that there should also be small nonzero probabilities of switching between the “” and the “” regions).
6The notation is for the argument of the maximum. We need the path(s) for which is maximal.
7For an observed sequence of just 90 nucleotides, the number of paths is a bit over , which is the estimated number of atoms in the observable universe. The brute force approach would therefore require years of computing time even if large computing resources are applied. In principle, we would be interested in sequences of tens of thousands of nucleotides.
8The “big ” notation here means that for very large values of and l the number of operations required by the Viterbi algorithm has the same order of magnitude as the number .
9Exercises marked with an asterisk indicate that their execution requires downloads from the volume’s website.
10For optimal viewing of the posterior probabilities, use sequences with lengths between 300 and 1200.
11The likelihood of the data, given the set of model parameters , is computed by the forward algorithm.
12The CpG Islands application in the CpG Educate suite uses these values as default parameters for the CGI HMM.