CHAPTER 6
Nucleotide Sequence Analysis Using Sequence Manipulation Suite (SMS)

CS Mukhopadhyay and RK Choudhary

School of Animal Biotechnology, GADVASU, Ludhiana

6.1 INTRODUCTION

Nucleotide and amino acid sequences are analyzed to understand their hidden features, to discover patterns, and to determine function, structure and their evolution. The frequently used in silico analyses using molecular sequences are: sequence alignment; determining conserved regions; identification of low‐complexity region of nucleotides; gene prediction; nucleotide sequence assembling; exploring biochemical and immunogenic properties of amino acid sequences; protein structure prediction, and so on. After completing this chapter, you will learn how to use some of the sequence analytical techniques using free online software called Sequence Manipulation Suite. The original examples cited in the software suit (as “help” for explaining the programs) have been used here. In some places, the explanations may be verbatim.

6.1.1 Sequence Manipulation Suite (SMS)

This is a collection (which is why it is called a “suite”) of software, written in JavaScript1.5, for generating, formatting, and analyzing short DNA and protein sequences. Paul Stothard (of the University of Alberta, Canada) wrote the software suite (Stothard, 2000). The off‐line suite can be downloaded from the link http://www.bioinformatics.org/sms2/mirror.html/.

Sequences are submitted to the sequence box of SMS and then analyzed according to the particular query.

6.2 OBJECTIVE

To learn the use of the different programs within SMS for analyzing nucleotide and amino acid sequences.

6.3 PROCEDURE

  1. Open the home page of Sequence Manipulation Suite using the URL: http://lion.img.cas.cz/sms2/index.html/. Check its compatibility with your web browser.
  2. Various analytical tools, categorized into five groups, are available in the panel on the left‐hand side of the page:
    1. Format Conversion
    2. Sequence Analysis
    3. Sequence Figures
    4. Random Sequences
    5. Miscellaneous
  3. Click on the required tool in the left‐hand side panel: for example, if the first tool “Combine FASTA” is clicked, the specific page is opened.
  4. Every tool has been explained for “how to use”, along with an example, on its original page.
  5. The common steps are to paste sequence(s) in the sequence box, and then click “Submit”. The result is returned in a separate window. The “Clear” button will erase all data in the sequence box. The last button “Reset” will delete any document pasted in the sequence box and will reset the parameter(s).

6.4 FORMAT CONVERSION

6.4.1 Combine FASTA

This converts multiple FASTA sequences (either nucleotide or amino acid records) into a single sequence (Figure 6.1). The software imposes a restriction of input to 500 000 characters in total (inclusive of description line and input sequences).

Top: Combine FASTA” input page to furnish input data, with encircled multiple sequences in FASTA format and Submit button. Bottom: Corresponding output page with the result.

FIGURE 6.1 “Combine FASTA” input page to provide input data, and the corresponding output page with the result.

6.4.2 EMBL feature extractor

This program extracts the salient features (according to the annotations) of one or more EMBL file(s), and returns the sequences in FASTA format in a new window. The program thus returns the whole nucleotide sequence, the mRNA and the cDNA sequence as separate FASTA format files as output. This is useful if the user wants to extract only the cds (coding sequence) or mRNA sequence out of the whole gene sequence (containing exons and introns, as well).

This program has a limit of 200 000 characters as input. There are two options for the output sequence features:

  • “separated” – only the specific portions, viz. mRNA or cDNA parts out of the whole sequence, will be shown in lower case; or
  • “UPPER‐case” – the specified stretches of the mRNA or cDNA will be highlighted in upper case while the rest of the source sequence will be in un‐highlighted upper case.

6.4.3 EMBL Trans Extractor

This program accepts one or more EMBL files and extracts the translated amino acid sequence in the result window (Figure 6.2). This program has a limit of 200 000 characters of input.

Image described by caption.

FIGURE 6.2 “EMBL Trans Extractor” input page, and the corresponding output page with extracted results.

6.4.4 Filter DNA

The input DNA sequence is filtered by eliminating the non‐DNA characters (digits and blank spaces) from the whole sequence (input limit is 500 000) (Figure 6.3).

“Filter DNA” input page displaying input nucleotide sequence and remove non 'gatcn' characters, replace removed characters with nothing, and don’t change the case of remaining characters options.

FIGURE 6.3 “Filter DNA” input page, along with various options as control parameters.

There are some options to modify filtration:

  • What to replace: “Characters” and/or “white spaces”.
  • Replace with what: n/N/t/T/u/U/*/‐/?
  • Case conversion of characters.

6.4.5 Filter Protein

This is similar to the “Filter DNA” program. It filters out non‐amino acid characters (digits, blank spaces, special characters) from an amino acid sequence. Some options are available on what to replace, replace with what and case conversion. The character limit of input is 500 000.

6.4.6 GenBank Feature Extractor

Similar to the “EMBL Feature Extractor” program. The input is nucleotide sequences in GenBank format.

6.4.7 GenBank Trans Extractor

Similar to the “EMBL Trans Extractor”. The input is the nucleotide sequence in GenBank format.

6.4.8 One to Three

This program converts single‐letter amino acid codes into three‐character amino acid codes. Single or multiple amino acid sequence(s) in FASTA format (one letter code) is/are required and pasted into the sequence box. The input limit is 100 000 characters.

6.4.9 Range Extractor DNA

This returns the specific nucleotide sequence, based on the position and/or range(s) of nucleotide(s)/nucleotide sequence(s) specified in the input. The user needs to paste the DNA sequence in the sequence box. The specific position(s) (given by the position value(s) of the base(s)) and/or the ranges (two position values for the termini, separated by “…”), separated by comma(s), are then given. There are some options to output the results in FASTA format (either in upper or lower case) in one sequence, or in multiple sets of sequences (for multiple positions/ranges). The range(s) can be specified either in the original strand (“direct strand” option) or in the complementary strand to the input sequence (“complementary strand” option). The input limit is 500 000 characters.

6.4.10 Range Extractor Protein

This program is similar to “Range Extractor DNA”, except that the input sequence is amino acid (in FASTA format) (Figure 6.4). Obviously, the drop‐down options for “direct strand” and “complementary strand” are not there for amino acids in this program.

Image described by caption.

FIGURE 6.4 “Range Extractor Protein” input page and the corresponding output page with extracted sequences.

6.4.11 Reverse Complement

This program is used to fetch the reverse‐complement of the input sequence, or obtaining the reverse sequence(s), or only the complement of given nucleotide sequence(s). It can work with single or multiple DNA sequence(s) as input. It supports all the IUPAC DNA alphabets (Figure 6.5). The input limit is 100 000 characters.

Image described by caption.

FIGURE 6.5 “Reverse Complement” input page and the corresponding output pages for “Complement”, “Reverse” and “Reverse Complement”, respectively (from left to right), of the input sequences.

6.5 SEQUENCE ANALYSIS

6.5.1 Codon Usage

This accepts single/multiple DNA sequence(s) in FASTA format, and estimates the number and frequency of usage of each of the 64 codons available in the specific genome (eukaryotic/prokaryotic, nuclear/mitochondrial, etc.). The output file presents the frequencies of occurrence of each of the codons in the given input sequence(s).The preference of a given sequence for a specific synonymous codon can be determined by this program. Input limit is 500 000 characters.

6.5.2 CpG Islands

This program estimates the Observed/Expected values for G/C dinucleotide contents in a 200 bp window, within a given DNA sequence and G/C content (Gardiner‐Garden and Frommer, 1987). CpG islands are like islets within a given DNA sequence (split in windows of a specific length) that are characterized by a higher Observed/Expected ratio (>0.6) of Cytosine‐Phosphate‐Guanosine (CpG) dimers and GC content greater than 50%.

[6.1] images

This program can also be used for identifying the 5′ regions of vertebrate genes, since these regions are often thronging with CpG dimers in vertebrates. The maximum input limit is 100 000 characters.

6.5.3 DNA Molecular Weight

This calculates the molecular weight of double/single‐stranded, linear/circular DNA sequence(s) (drop‐down options are there to select the types of DNA molecule(s)) in FASTA format. Standard IUPAC base symbols are accepted. The character limit is 200 000. This program is used for calculating molecule copy number.

6.5.4 DNA Pattern Find

This program scans one or more submitted DNA sequence(s) for a specific pattern instructed by the user. The default pattern is “ctt[ca]”, which searches for occurrences of “cttc” and “ctta”. The user can modify it. The output file mentions the base positions (start and end) of the match, along with the number of times that it has been identified in the direct (original) or reverse strand. “DNA Pattern Find” is used to screen the input sequence (as a raw sequence of FASTA formatted sequence(s)) and localize the pattern of interest. The input limit is 500 000 characters.

6.5.5 DNA Stats

A very useful program to obtain the number, as well as the percentage, of each of the bases from the input sequence(s) in terms of the kinds of bases (means, pyrimidine, purine, A/T, etc.). The limit of input is 500 000 characters. The sequence(s) are submitted as a raw sequence, or as one or more FASTA‐format.

6.5.6 Mutate for Digest

This program is used to explore mutable regions in a DNA sequence (provided in FASTA format) to generate a restriction site to study the effect of mutation on restriction digestion. The output file also displays the translation of the DNA (according to the reading frame indicated by the user), to determine the alterations in various reading frames (RFs) due to the proposed mutation. Thus, experiments involving polymerase chain reactions (PCR) or site‐directed mutagenesis can be studied in silico using this program. Four parameters can be set for alteration of output:

  1. Search for future __ <Restriction Enzyme Name> __ sites: Almost all commercially available restriction enzymes (REs) have been enlisted. The user needs to choose one RE, according to the requirement or proposal.
  2. Show _ <Number> _ of bases per line: Choose any one of 30, 45, 60, 75, 90 and 105.
  3. Show the translation for reading frame: RFs can be 1, 2, 3, all 3, upper case or none.
  4. Use the __ <Genetic Code Options> _ genetic code: The options are prokaryotic, eukaryotic, nuclear, or mitochondrial codes. The input limit is 10 000 characters.

6.5.7 Protein Isoelectric Point

The theoretical isoelectric point (pI) is calculated for single or multiple amino acid sequence(s) (in FASTA format, with input limit of 200 000 characters), to estimate the probable location of a protein on a 2D gel. The user can add up to five copies of one of the 21 optional epitopes and fusion protein tags listed (e.g., His6, HSV, Glu‐Glu, etc.) to modify the pH of the submitted amino acid sequences (Figure 6.6).

Image described by caption.

FIGURE 6.6 “Protein Isoelectric Point” input page and the corresponding output page with results, with respect to the parameters.

6.5.8 Protein Molecular Weight

This calculates the molecular weight of one or more protein sequence(s), entered in FASTA format or as a raw (unformatted) sequence (character limit is 200 000). The user can append 1–5 copies of one out of the 21 enlisted epitopes and fusion proteins. This program is used to predict a recombinant or simple protein by determining the position of a particular protein on a gel, compared with a set of protein standards.

6.5.9 Protein Pattern Find

Similar to “DNA Pattern Find”, this program is used to search a query (i.e., any consensus amino acid sequence) within one or more input sequence(s) (entered in FASTA format, and with character limit 500 000). The default search pattern is “X[^X]{0,5}X”, which means that the user wants to search for the occurrence of two residues of the amino acid “X” which may be spanned by 0–5 amino acids (other than X) in between.

6.5.10 Protein Stats

Similar to the DNA Stats program, this is used to obtain data such as times of occurrences of each residue in one or more input sequence(s) (in FASTA format or raw sequence; input limit 500 000 characters).

6.5.11 Restriction Summary

Returns the positions of the restriction sites for all the enlisted regularly used REs against one or more linear or circular DNA sequence in FASTA format (100 000 base limit). This program is very useful to scan a DNA sequence for possible RE sites present.

6.5.12 Reverse Translate

This returns the reverse translated nucleotide sequence(s), along with a consensus sequence for each amino acid, from one or more input amino acid sequence(s) (in FASTA format with a limit of 20 000 characters), based on the codon usage table entered by the user (selected from http://www.kazusa.or.jp/codon/). This program is used to design oligos that target a (not yet sequenced) coding region belonging to a related species.

6.6 SEQUENCE FIGURES

6.6.1 Restriction Map

This displays a “textual map” for the RE sites in the template DNA (FASTA format input; input limit is 100 000 characters) which can be exploited for exploring the RE sites for cloning a sequence. It also returns the in silico translated amino acid sequence, according to the user‐defined reading frame.

6.6.2 Translation Map

Depicts a textual map for displaying in silico translations of the input DNA sequence (in FASTA format; input limit is 500 000 characters), according to the first, second, third, or all three reading frames (RFs). This program understands IUPAC codes and different genetic codes being used.

6.7 RANDOM SEQUENCES

6.7.1 Mutate DNA

This introduces random mutation(s) in a coding sequence (presented in FASTA format as input sequence; input limit is 100 000 characters), which are studied to assess the effect of spontaneous mutation on the nature of the encoded peptide. The user can specify the number of mutation(s) and whether mutation is to occur in the start and stop codon of the mRNA.

6.7.2 Mutate Protein

Similar to the “Mutate DNA” program, this affects the mutation rate in an amino acid sequence. Multiple mutations can occur, just like in the “Mutate DNA” program, in the same amino acid position. This program is used to assess the effect of mutation on the chemical nature of the peptide, and the phenotypic effect on the trait.

6.7.3 Random Coding DNA

This produces a random coding sequence (ORF from start to stop codon), based on the user‐specified genetic code and ORF length. Such ORFs are used to study the evolutionary perspectives and speciation.

6.7.4 Random DNA Sequence

Similar to “Random Coding Sequence”, this generates random DNA instead of a coding sequence.

6.8 MISCELLANEOUS

  1. IUPAC codes: The International Union of Pure and Applied Chemistry is a physical body that provides codes for protein and DNA sequence. The list has been given in the previous chapter (Chapter 5: Sequence Format Conversion).
  2. Genetic codes, Browser compatibility, Reference, etc.

6.9 QUESTIONS

  1. 1. Find the reverse complement and reverse sequence of the following sequences:

    > Seq1_GenBank_Acc_No_ AB002707.1

    A G A T A A T A C T T G A G A C G T T C C A G T T T N T A T T A G T A C A A A A T G N C C A A T T C A T T C A A T G A A T T G A G A A A T G A C A T T C T A A G T G A G T T A G G A G C C A C G A C A A T T G T A G A A C A C A C A G T G T T T A A C A A G T A A C C A A T G A G A A T T N N T G A T C T A T C A A T C A G T T G G T A G T A T C G A G G A C T A C C A A G A T T A T A A C G G A A T A A C G A G G A A T T

    > Seq2_GenBank_Acc_No_ KT779508.1

    T G A G T A A A T C A G T T A T A G T T T G T T T G A T G G T A T C T A C T A C T C G G A T A A C C G T A G T A A T T C T A G A G C T A A T A C G T G C A A C A A A C C C C G A C T T C T G G A A G G G A T G C A T T T A T T A G A T A A A A G G T C G A C G C G G G C T C T G C C C G T T G C T G C G A T G A T T C A T G A T A A C T C G A C G G A T C G C A C G G C C A T C G T G C C G G C G A C G C A T C A T T C A A A T T T C T G C C C T A T C A A C T T T C G A T G G T A G G A T A G T G G C C T A C C A T G G T G G T G A C G G G T G A C G G A G A A T T A G G G T T C G A T T C C G G A G A G G G A G C C T G A G A A A C G G C T A C C A C A T C C A A G G A A G G C A G C A G G C G C G C A A A T T A C C C A A T C C T G A C A C G G G G A G G T A G T G A C A A T A A A T A A C A A T A C C G G G C T C A A T G A G T C T G G T A A T T G G A A T G A G T A C A A T C T A A A T C C C T T A A

  2. 2. Determine whether the given sequence has a CpG island:

    > Seq1_GenBank_Acc_No_ XM_014823107.1

    A T G A G A A G C G G C A T C A T A G C G C A G T G C G C T T T C T G T G T A A C T C G C G G C A A C G T C G C T C A G G C A A G C T T T C G A T T T C T G G C C C A G A A C T T C G G C C G C A A G A T C T G T C C G C T A G C T T G G G C A C A C T C G T C G G A T C G G T G C C G C A G C T G C T T C T G G C G C G G C C G G A T A C C A G A C A T A C C A G A G C G T G A T T A C C T G C G T G T G T G G G C G C A A G A G G A T C T C A A C G T C A T C G T C A T C G T C A T G G C A A C C C T T G G C A A G T T T G C C T T A A C G G T T A C G T T C G C C G T C T G C T A C C T G T A C A G C G G T G A G A T C T A C C C G A C T G C C A T C C G G A A T G T C G G A C T T G G A A G C A A T T C G G C T T G T G C G C G G G T C G G A G C G A T G G T G G C G C C A T A T A T C A C C C T G C T G G C C A A G G A C G T G G C G T G G C T G C C C A T G G T A C T G T T C G G C G C G C T G G C A G T G G T T G C T G C T C T G C T G G C A G C C A T G T T G C C A G A G A C G C G A A A T T G C C A T C T G C C A G A G A C G A T C G A A G A C G G A G A G A A T T T C A A C A G

  3. 3. Enumerate the DNA statistics of the nucleotide sequence with NCBI Nucleotide accession number S78771.1
  4. 4. Open the sequence with NCBI GenBank acc. No. NM_001271282.2, and then extract the features of the sequence.
  5. 5. Provide the restriction summary of the sequence given below:

    > Seq1_GenBank_Acc_No_ XM_012883685.1

    A T G G T A G A G G A C G A G G A C G A A G A C G A A G A T A C G T C T A A C A A C A G C A G C T C A G A T G A C A G C A G C A G C T C C G A T G A C G A T G A C G A T G A C G T C C C A G A C G A T G A C G A G T A T G A T G T T A A G A A A G T T A A G C A C C G A G A G G A G G T G C C G C G C A T T C A G A T A G T T G G A T C A A G G T C G C A A T G G T T G G A A G C A A T C C G C A G A G A C G G C A C G G C A G G T G A G T C A G C T A G G A T G A A G G C A T T C T T A G A G G T A T T T C G C G A A G C C C A A C A C C T T T A T C C T G A C C A G A G A G T T T C T G C T A C C T C C G A G G A G A C G A A G A C C C T T G A T A T C G T C G C C C T T A T T C T A A A G G A T G A A G G G A A A A T C T G T G T G C A A T A T G A T G G C A T A C T T C C G C C C C G C G A T A G G G C A G C A G C G C T A A A G A C A T T C C A G G A T G G G G C T C C A G C T A C C T T T G T C T G A

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset