CHAPTER 5
Sequence Format Conversion

CS Mukhopadhyay and RK Choudhary

School of Animal Biotechnology, GADVASU, Ludhiana

5.1 INTRODUCTION

A computer file format is a distinct way of encoding data to store in a file. Biological sequence format is an assemblage of distinct file formats, with the aim of rendering the files legible to specific programs.

Note: Biological sequences are generally written in Courier New font. This enables us to arrange the sequences uniformly in each line of the text

Sequence formats are manipulated or inter‐converted by the system in the base level through ASCII (American Standard Code for Information Interchange – i.e. binary code) text – that is, A–Z characters are encoded by 65–90; a–z characters by 97–122. Thus, the sequence formats are the required arrangement of characters, symbols, and keywords that specify the sequence, ID name, comments, and so on.

The sequence formats are needed for two purposes:

  1. Different programs recognize different types of formats. We need to convert one format to an other to use the sequence for that program.
  2. Presentations of the molecular sequence are sometimes required in a particular format.

Commonly used sequence formats.

1. IG/Stanford 7. Fitch13. Plain/Raw
2. GenBank/GB 8. Pearson/Fasta14. PIR/CODATA
3. NBRF 9. Zuker (in‐only)15. MSF
4. EMBL10. Olsen (in‐only)16. ASN.1
5. GCG11. Phylip3.217. PAUP
6. DNAStrider12. Phylip18. Pretty (out‐only)

5.2 OBJECTIVE

To convert the format of a given molecular sequence to other sequence formats like NCBI, EMBL, PIR, etc.

5.3 PROCEDURE

The online program ReadSeq (by Don Gilbert) will be used to convert the sequence formats. ReadSeq accepts the following formats: FASTA, Abstract Syntax Notation (ASN.1), National Biomedical Research Foundation (NBRF), EMBL, Fitch (phylogenetic analysis), GenBank, GCG, DNA Strider, Intelligenetics, Multiple sequence format, Protein Information Resource (PIR), and eight additional specialised formats.

  1. Open the online ReadSeq sequence conversion tool using the URL: http://www‐bimas.cit.nih.gov/molbio/readseq/
  2. A molecular sequence (nucleotide or amino acid sequence) in any format is pasted into the text box. The software can determine the input sequence automatically (Figure 5.1).
  3. Click on the drop‐down menu, just above the text box (on the left side) and select the desired output format.
  4. There are additional formatting options:
  5. Altering the case of the output sequence: click on one of the radio buttons “MiXeD case”, “UPPER” or “lower” case.
  6. Removal of the gaps: click on the check box to remove existing gaps in the input sequence.
  7. Click on the “Submit” button to get the output.
  8. The “reset” button is there to erase all the input data and start afresh with default settings.
Image described by surrounding text.

FIGURE 5.1 Homepage of the ReadSeq biosequence format conversion tool.

The International Union of Pure and Applied Chemistry (IUPAC) nucleic acid code has been adopted to specify a single or a group of nucleotide(s) by a single alphabet:

A = adenineU = uracilM = A or C (amino)D = G or A or T
C = cytosineR = G or A (purine)S = G or CH = A or C or T
G = guanineY = T or C (pyrimidine)W = A or TV = G or C or A
T = thymineK = G or T (keto)B = G or T or CN = A or G or C or T (any)

IUPAC amino acid codes:

A = AlanineG = GlycineM = MethionineS = Serine
C = CysteineH = HistidineN = AsparagineT = Threonine
D = Aspartic AcidI = IsoleucineP = ProlineV = Valine
E = Glutamic AcidK = LysineQ = GlutamineW = Tryptophan
F = PhenylalanineL = LeucineR = ArginineY = Tyrosine

5.3.1 Other online sequence conversion tools

  1. FMTSeq – This is an elaborative version of ReadSeq. It is furnished with data manipulation for ClustalW, Zuker, ELEX (I/O files) and so on. URL: http://www.bioinformatics.org/JaMBW/1/2/
  2. Emboss: This has several features, including cutseq, pasteseq, nthseq, extractseq, and so on. URL: http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
  3. EMBOSS Seqret: This is another sequence format conversion tool available online, offering several output formats for conversion. The URL is as follows: http://www.ebi.ac.uk/Tools/sfc/emboss_seqret/

5.4 QUESTIONS

  1. 1. Identify the sequence format given below:
    A >DL;readseq‐43434_tmp_1
    readseq‐43434_tmp_1 100 bases
    cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagagg
    cgccatcatccggggcatccccggcttctgggccaatgccattgcgaacc*
    BLOCUS readseq‐13129_tmp_1 100 bp
    ORIGIN
    1 cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagaggcgccatcatc
    61 cggggcatccccggcttctgggccaatgccattgcgaacc
    //
    C>readseq‐14738_tmp_1 100 bp
    cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagaggcgccatcatc
    cggggcatccccggcttctgggccaatgccattgcgaacc
    DID readseq‐10695_tmp_1 standard; DNA; UNC; 100 BP.
    SQ Sequence 100 BP;
    cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagaggcgccatcatc 60
    cggggcatccccggcttctgggccaatgccattgcgaacc 100
    Ereadseq‐946_tmp_1 cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagaggcgccatcatc
    readseq‐946_tmp_1 cggggcatccccggcttctgggccaatgccattgcgaacc
    F1 100
    readseq‐26 cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagagg
    cgccatcatccggggcatccccggcttctgggccaatgccattgcgaacc
  2. 2. Download a nucleotide sequence of your interest from NCBI Nucleotide. Then convert it to the following formats:
    a. Clustalb. EMBLc. Phylip
  3. 3. Given below is an amino acid sequence (GenBank: BAA36473.1) in lower case. Convert it to upper case and show in PIR format:QTEKLERRRKPHLDRRGAIIRGIPGFWANAIANHPQMSALITDQDE
  4. 4. Suppose you have custom sequenced a cloned product. How will you open the sequence file, and to which format will you convert it to do basic biocomputational analysis (i.e., using BLAST, Alignment, in silico translation (if applicable), etc.)?
  5. 5. What are the uses of sequence format conversion? A DNA sequence has been presented in some of the commonly used formats. Please write the name of the formats.

    (A)

    >readseq‐26104_tmp_1 204 bp

    ccatgaacgccttcattgtgtggtctcgtgaacgaagacgaaaggtggctctagagaatc

    ccaaaatgaaaaactcagacatcagcaagcagctgggatatgagtggaaaaggcttacag

    atgctgaaaagcgcccattctttgaggaggcacagagactactagccatacaccgagaca

    aatacccgggctataaatatcgac

    (B)

    LOCUS readseq‐11577_tmp_1 204 bp

    ORIGIN

    1 ccatgaacgccttcattgtgtggtctcgtgaacgaagacgaaaggtggctctagagaatc

    61 ccaaaatgaaaaactcagacatcagcaagcagctgggatatgagtggaaaaggcttacag

    121 atgctgaaaagcgcccattctttgaggaggcacagagactactagccatacaccgagaca

    181 aatacccgggctataaatatcgac

    (C)

    ID readseq‐2117_tmp_1 standard; DNA; UNC; 204 BP.

    SQ Sequence 204 BP;

    ccatgaacgccttcattgtgtggtctcgtgaacgaagacgaaaggtggctctagagaatc 60

    ccaaaatgaaaaactcagacatcagcaagcagctgggatatgagtggaaaaggcttacag 120

    atgctgaaaagcgcccattctttgaggaggcacagagactactagccatacaccgagaca 180

    aatacccgggctataaatatcgac 204

    //

    (D)

    \

    ENTRY readseq‐18456_tmp_1

    TITLE readseq‐18456_tmp_1 204 bases

    SEQUENCE

    5 10 15 20 25 30

    1 c c a t g a a c g c c t t c a t t g t g t g g t c t c g t g

    31 a a c g a a g a c g a aa g g t g g c t c t a g a g a a t c

    61 c c a aaa t g a aaaa c t c a g a c a t c a g c a a g c

    91 a g c t g gg a t a t g a g t g g a aaa g g c t t a c a g

    121 a t g c t g a aaa g c g c cc a t t c t tt g a g g a g g

    151 c a c a g a g a c t a c t a g c c a t a c a c c g a g a c a

    181 a a t a c cc g gg c t a t a aa t a t c g a c

    ///

5.5 BRIEF DESCRIPTION OF SOME OF THE IMPORTANT MOLECULAR SEQUENCE FORMATS

  1. FASTA/Pearson format: This is the simplest and the most common form of representing biological sequences. It was developed by Pearson and Lipman (1996).

    Features:

    • It starts with a “>sign, followed by a sequence identifier that designates name, description, identity number of the sequence.
    • One‐letter symbols represent the sequence entities.
    • The sequence is written continuously (without gaps or numbering).
    • In the end, the asterisk (i.e., “*”) symbol indicates the end of the sequence (this is optional).
  2. PHYLIP format: This is the format of the Phylip package for phylogenetic analysis.

    Features:

    • The first line of the input file contains the number of species and number of characters in that sequence (with space, no comma).
    • The information for each sequence starts with a ten‐character‐long species name (any alphabetic characters, with or without space and dots).
    • This is followed by a string of sequences.
    • The sequences may be interleaved or sequential.
  3. CLUSTAL/.ALN format: This format originated from the CLUSTAL program for sequence alignment. The alignment is written in blocks of 60 characters, and the sequence is written in either UPPER CASE or lower case.

    Features:

    • Every block starts with sequence name (of any length), followed by at least one space.
    • ” denotes gap (for InDel) in multiple or pair‐wise sequence alignment.
    • Residue number is shown at the terminus of each line (optional).
    • The last line bearing asterisks (*) at the end of each block indicates the conservation (in sequence alignment) (Figure 5.2).
  4. GCG format: The programs in the Genetics Computer Group (GCG) suite use the GCG format of molecular sequences.

    Features:

    • The sequence begins with either of the following lined (mandatorily all uppercase)
      • <for nucleic acid sequences>: !!NA_MULTIPLE_ALIGNMENT 1.0
      • <for amino acid sequences>: !!AA_MULTIPLE _ALIGNMENT 1.0
    • The next line is a description line that holds sequence information.
    • There is one dividing line that shows the number of molecular elements (residues) in the sequence, date and time of creation of file, a checksum (a number that indicates the total number of correct digits in a digital data, in order to compare data corruption or data loss in a process of storage or data transmission).
    • Two dots (..) act as a divider between the descriptive line and the sequence. These dots are not optional.
  5. GenBank format: This format is used to display the sequences in the GenBank flat file at NCBI. The format has three parts:
    • The header: contains the locus field (locus name, sequence length, molecule type, GenBank division, modification date, definition, accession, etc.), Geneinfo identifier, key words, source, organism, reference, authors, title, and so on.
    • Features: information about genes and gene products, regions of biological significance in the sequence, a sequence that code for proteins and RNA molecules.
    • Sequence: contains the sequence with row numbering. Each row contains 60 entities/residues sub‐divided into six blocks (each of 10 residues).
  6. NBRF format: The National Biomedical Research Foundation (NBRF) format is read and written by Multalign Viewer. It is also known as Protein Information Resources (PIR) format.

    Features:

    • The first line starts with “>”, followed by the sequence code.
    • The second line displays sequence information.
    • The third line onwards of the sequence is presented in the form of blocks of 10 entities, with five blocks in each row.
    • When multiple sequences are presented in NBRF format, individual sequences are concatenated together to make them of equal length, using leading or trailing characters, and gap positions.
    • Any non‐alphanumeric characters except the asterisk (*) can be used to make the sequence legible to Multalign Viewer.
    • Spaces within the sequence are ignored.
  7. Rich Sequence Format (rsf) format: These files harbor one or more sequences which could be either related or unrelated to each other. To create a file in “rsf format,” GCG’s NetFetch program can be used to download the flat file from NCBI and save it in *.rsf format. The rsf files are particularly useful for “Seqlab” (the graphical user interface version of GCG).

    Features:

    • The sequence is presented in a manner similar to EMBL format.
    • The annotations of the sequence are rich, and the important points are:
      • Author: list of authors related to sequence.
      • Sequence weight.
      • Date of creation.
      • Description line of the sequence.
      • Number of leading gaps (called Offset) in the sequence corresponding to an alignment or assembly of fragments.
      • Other sequence features.
No alt text required.

FIGURE 5.2 Three sequence formats – namely, FASTA, Phylip and Clustal.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset