Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 5
Sequence Format Conversion

CS Mukhopadhyay and RK Choudhary

School of Animal Biotechnology, GADVASU, Ludhiana

5.1 INTRODUCTION

A computer file format is a distinct way of encoding data to store in a file. Biological sequence format is an assemblage of distinct file formats, with the aim of rendering the files legible to specific programs.

Note: Biological sequences are generally written in Courier New font. This enables us to arrange the sequences uniformly in each line of the text

Sequence formats are manipulated or inter‐converted by the system in the base level through ASCII (American Standard Code for Information Interchange – i.e. binary code) text – that is, A–Z characters are encoded by 65–90; a–z characters by 97–122. Thus, the sequence formats are the required arrangement of characters, symbols, and keywords that specify the sequence, ID name, comments, and so on.

The sequence formats are needed for two purposes:

Different programs recognize different types of formats. We need to convert one format to an other to use the sequence for that program.
Presentations of the molecular sequence are sometimes required in a particular format.

Commonly used sequence formats.

1. IG/Stanford	7. Fitch	13. Plain/Raw
2. GenBank/GB	8. Pearson/Fasta	14. PIR/CODATA
3. NBRF	9. Zuker (in‐only)	15. MSF
4. EMBL	10. Olsen (in‐only)	16. ASN.1
5. GCG	11. Phylip3.2	17. PAUP
6. DNAStrider	12. Phylip	18. Pretty (out‐only)

5.2 OBJECTIVE

To convert the format of a given molecular sequence to other sequence formats like NCBI, EMBL, PIR, etc.

5.3 PROCEDURE

The online program ReadSeq (by Don Gilbert) will be used to convert the sequence formats. ReadSeq accepts the following formats: FASTA, Abstract Syntax Notation (ASN.1), National Biomedical Research Foundation (NBRF), EMBL, Fitch (phylogenetic analysis), GenBank, GCG, DNA Strider, Intelligenetics, Multiple sequence format, Protein Information Resource (PIR), and eight additional specialised formats.

Open the online ReadSeq sequence conversion tool using the URL: http://www‐bimas.cit.nih.gov/molbio/readseq/
A molecular sequence (nucleotide or amino acid sequence) in any format is pasted into the text box. The software can determine the input sequence automatically (Figure 5.1).
Click on the drop‐down menu, just above the text box (on the left side) and select the desired output format.
There are additional formatting options:
Altering the case of the output sequence: click on one of the radio buttons “MiXeD case”, “UPPER” or “lower” case.
Removal of the gaps: click on the check box to remove existing gaps in the input sequence.
Click on the “Submit” button to get the output.
The “reset” button is there to erase all the input data and start afresh with default settings.

Image described by surrounding text. — **FIGURE 5.1** Homepage of the *ReadSeq* biosequence format conversion tool.

The International Union of Pure and Applied Chemistry (IUPAC) nucleic acid code has been adopted to specify a single or a group of nucleotide(s) by a single alphabet:

A = adenine	U = uracil	M = A or C (amino)	D = G or A or T
C = cytosine	R = G or A (purine)	S = G or C	H = A or C or T
G = guanine	Y = T or C (pyrimidine)	W = A or T	V = G or C or A
T = thymine	K = G or T (keto)	B = G or T or C	N = A or G or C or T (any)

IUPAC amino acid codes:

A = Alanine	G = Glycine	M = Methionine	S = Serine
C = Cysteine	H = Histidine	N = Asparagine	T = Threonine
D = Aspartic Acid	I = Isoleucine	P = Proline	V = Valine
E = Glutamic Acid	K = Lysine	Q = Glutamine	W = Tryptophan
F = Phenylalanine	L = Leucine	R = Arginine	Y = Tyrosine

5.3.1 Other online sequence conversion tools

FMTSeq – This is an elaborative version of ReadSeq. It is furnished with data manipulation for ClustalW, Zuker, ELEX (I/O files) and so on. URL: http://www.bioinformatics.org/JaMBW/1/2/
Emboss: This has several features, including cutseq, pasteseq, nthseq, extractseq, and so on. URL: http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
EMBOSS Seqret: This is another sequence format conversion tool available online, offering several output formats for conversion. The URL is as follows: http://www.ebi.ac.uk/Tools/sfc/emboss_seqret/

5.4 QUESTIONS

1. Identify the sequence format given below:

A	>DL;readseq‐43434_tmp_1 readseq‐43434_tmp_1 100 bases cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagagg cgccatcatccggggcatccccggcttctgggccaatgccattgcgaacc*
B	LOCUS readseq‐13129_tmp_1 100 bp ORIGIN 1 cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagaggcgccatcatc 61 cggggcatccccggcttctgggccaatgccattgcgaacc //
C	>readseq‐14738_tmp_1 100 bp cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagaggcgccatcatc cggggcatccccggcttctgggccaatgccattgcgaacc
D	ID readseq‐10695_tmp_1 standard; DNA; UNC; 100 BP. SQ Sequence 100 BP; cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagaggcgccatcatc 60 cggggcatccccggcttctgggccaatgccattgcgaacc 100
E	readseq‐946_tmp_1 cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagaggcgccatcatc readseq‐946_tmp_1 cggggcatccccggcttctgggccaatgccattgcgaacc
F	1 100 readseq‐26 cagacggaaaagctggagcgcaggcgcaagccccacctggaccgcagagg cgccatcatccggggcatccccggcttctgggccaatgccattgcgaacc

2. Download a nucleotide sequence of your interest from NCBI Nucleotide. Then convert it to the following formats:
a. Clustal b. EMBL c. Phylip
3. Given below is an amino acid sequence (GenBank: BAA36473.1) in lower case. Convert it to upper case and show in PIR format:QTEKLERRRKPHLDRRGAIIRGIPGFWANAIANHPQMSALITDQDE
4. Suppose you have custom sequenced a cloned product. How will you open the sequence file, and to which format will you convert it to do basic biocomputational analysis (i.e., using BLAST, Alignment, in silico translation (if applicable), etc.)?
5. What are the uses of sequence format conversion? A DNA sequence has been presented in some of the commonly used formats. Please write the name of the formats.
(A)
>readseq‐26104_tmp_1 204 bp
ccatgaacgccttcattgtgtggtctcgtgaacgaagacgaaaggtggctctagagaatc
ccaaaatgaaaaactcagacatcagcaagcagctgggatatgagtggaaaaggcttacag
atgctgaaaagcgcccattctttgaggaggcacagagactactagccatacaccgagaca
aatacccgggctataaatatcgac
(B)
LOCUS readseq‐11577_tmp_1 204 bp
ORIGIN
1 ccatgaacgccttcattgtgtggtctcgtgaacgaagacgaaaggtggctctagagaatc
61 ccaaaatgaaaaactcagacatcagcaagcagctgggatatgagtggaaaaggcttacag
121 atgctgaaaagcgcccattctttgaggaggcacagagactactagccatacaccgagaca
181 aatacccgggctataaatatcgac
(C)
ID readseq‐2117_tmp_1 standard; DNA; UNC; 204 BP.
SQ Sequence 204 BP;
ccatgaacgccttcattgtgtggtctcgtgaacgaagacgaaaggtggctctagagaatc 60
ccaaaatgaaaaactcagacatcagcaagcagctgggatatgagtggaaaaggcttacag 120
atgctgaaaagcgcccattctttgaggaggcacagagactactagccatacaccgagaca 180
aatacccgggctataaatatcgac 204
//
(D)
\
ENTRY readseq‐18456_tmp_1
TITLE readseq‐18456_tmp_1 204 bases
SEQUENCE
5 10 15 20 25 30
1 c c a t g a a c g c c t t c a t t g t g t g g t c t c g t g
31 a a c g a a g a c g a aa g g t g g c t c t a g a g a a t c
61 c c a aaa t g a aaaa c t c a g a c a t c a g c a a g c
91 a g c t g gg a t a t g a g t g g a aaa g g c t t a c a g
121 a t g c t g a aaa g c g c cc a t t c t tt g a g g a g g
151 c a c a g a g a c t a c t a g c c a t a c a c c g a g a c a
181 a a t a c cc g gg c t a t a aa t a t c g a c
///

5.5 BRIEF DESCRIPTION OF SOME OF THE IMPORTANT MOLECULAR SEQUENCE FORMATS

FASTA/Pearson format: This is the simplest and the most common form of representing biological sequences. It was developed by Pearson and Lipman (1996).
Features:
- It starts with a “>” sign, followed by a sequence identifier that designates name, description, identity number of the sequence.
- One‐letter symbols represent the sequence entities.
- The sequence is written continuously (without gaps or numbering).
- In the end, the asterisk (i.e., “*”) symbol indicates the end of the sequence (this is optional).
PHYLIP format: This is the format of the Phylip package for phylogenetic analysis.
Features:
- The first line of the input file contains the number of species and number of characters in that sequence (with space, no comma).
- The information for each sequence starts with a ten‐character‐long species name (any alphabetic characters, with or without space and dots).
- This is followed by a string of sequences.
- The sequences may be interleaved or sequential.
CLUSTAL/.ALN format: This format originated from the CLUSTAL program for sequence alignment. The alignment is written in blocks of 60 characters, and the sequence is written in either UPPER CASE or lower case.
Features:
- Every block starts with sequence name (of any length), followed by at least one space.
- “‐” denotes gap (for InDel) in multiple or pair‐wise sequence alignment.
- Residue number is shown at the terminus of each line (optional).
- The last line bearing asterisks (*) at the end of each block indicates the conservation (in sequence alignment) (Figure 5.2).
GCG format: The programs in the Genetics Computer Group (GCG) suite use the GCG format of molecular sequences.
Features:
- The sequence begins with either of the following lined (mandatorily all uppercase)
  - <for nucleic acid sequences>: !!NA_MULTIPLE_ALIGNMENT 1.0
  - <for amino acid sequences>: !!AA_MULTIPLE _ALIGNMENT 1.0
- The next line is a description line that holds sequence information.
- There is one dividing line that shows the number of molecular elements (residues) in the sequence, date and time of creation of file, a checksum (a number that indicates the total number of correct digits in a digital data, in order to compare data corruption or data loss in a process of storage or data transmission).
- Two dots (..) act as a divider between the descriptive line and the sequence. These dots are not optional.
GenBank format: This format is used to display the sequences in the GenBank flat file at NCBI. The format has three parts:
- The header: contains the locus field (locus name, sequence length, molecule type, GenBank division, modification date, definition, accession, etc.), Geneinfo identifier, key words, source, organism, reference, authors, title, and so on.
- Features: information about genes and gene products, regions of biological significance in the sequence, a sequence that code for proteins and RNA molecules.
- Sequence: contains the sequence with row numbering. Each row contains 60 entities/residues sub‐divided into six blocks (each of 10 residues).
NBRF format: The National Biomedical Research Foundation (NBRF) format is read and written by Multalign Viewer. It is also known as Protein Information Resources (PIR) format.
Features:
- The first line starts with “>”, followed by the sequence code.
- The second line displays sequence information.
- The third line onwards of the sequence is presented in the form of blocks of 10 entities, with five blocks in each row.
- When multiple sequences are presented in NBRF format, individual sequences are concatenated together to make them of equal length, using leading or trailing characters, and gap positions.
- Any non‐alphanumeric characters except the asterisk (*) can be used to make the sequence legible to Multalign Viewer.
- Spaces within the sequence are ignored.
Rich Sequence Format (rsf) format: These files harbor one or more sequences which could be either related or unrelated to each other. To create a file in “rsf format,” GCG’s NetFetch program can be used to download the flat file from NCBI and save it in *.rsf format. The rsf files are particularly useful for “Seqlab” (the graphical user interface version of GCG).
Features:
- The sequence is presented in a manner similar to EMBL format.
- The annotations of the sequence are rich, and the important points are:
  - Author: list of authors related to sequence.
  - Sequence weight.
  - Date of creation.
  - Description line of the sequence.
  - Number of leading gaps (called Offset) in the sequence corresponding to an alignment or assembly of fragments.
  - Other sequence features.

No alt text required. — **FIGURE 5.2** Three sequence formats – namely, FASTA, Phylip and Clustal.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 5: Sequence Format Conversion

Create new playlist

Sign In