Appendix D: EMBL Databases and Tools: An Overview

S Jain1, S Panwar2 and A Kumar3

1 Department of Applied Sciences & Humanities, Jai Parkash Mukand Lal Innovative Engineering and Technology Institute, Haryana, India

2 Department of Genetics and Plant Breeding, Chaudhary Charan Singh University, Uttar Pradesh, India

3 Department of Nutrition Biology, Central University of Haryana, Haryana, India

INTRODUCTION

The European Bioinformatics Institute (EBI) is a constituent body of EMBL and is situated at the Wellcome Trust Genome Campus, Cambridge (UK). It provides all sorts of molecular data, as well as bioinformatics databases, software and tools, at no cost. It has all kinds of life sciences information, and helps in basic and advanced research. The information in the databases and tools described in this chapter is extracted from the EMBL‐guide and related sites. Therefore, in several instances, the information given may be verbatim.

THE EMBL DATABASES

Information on each of the databases has been collected from EMBL. The databases available via dbfetch are listed in Table 1. An overview of each database is also provided, which includes a short description and link to the databases.

TABLE 1 Features and links of various EMBL databases.

S.N. Databases Features Links
1. EDAM EMBRACE Data and Methods (EDAM) Ontology. http://edamontology.sourceforge.net/
2. ENA Coding European Nucleotide Archive (ENA) Coding is a database of nucleotide sequences of the CDS (coding sequence) features, as annotated in the ENA Sequence database. ENA Coding records contain the nucleotide sequence of the CDS, along with annotated parent nucleotide, in addition to spontaneously produced annotation. http://www.ebi.ac.uk/ena/
3. ENA Geospatial A database of nucleotide sequences of the ENA Geospatial Sequence. http://www.ebi.ac.uk/ena/
4. ENA Non‐coding A database of nucleotide sequences of the non‐coding RNA features, as annotated in the ENA Sequence database. ENA Non‐coding records contain the nucleotide sequence of the RNA feature, along with annotated parent nucleotide, in addition to spontaneously produced annotation. http://www.ebi.ac.uk/ena/
5. ENA Sequence ENA Sequence (formerly known as EMBL‐Bank) is Europe’s primary nucleotide sequence resource. The main sources of the DNA and RNA sequences in the database are submissions from individual researchers, genome sequencing projects, and patent applications. http://www.ebi.ac.uk/ena/
6. ENA Sequence Constructed The ENA Sequence Constructed database division represents complete genomes and other long sequences constructed from segment entries. Instead of containing the sequence, these entries detail how to assemble the sequence from other ENA Sequence entries. http://www.ebi.ac.uk/ena/
7. ENA Sequence Constructed Expanded Expanded entries include the complete nucleotide sequence of the constructed entry. http://www.ebi.ac.uk/ena/
8. ENA/SVA The ENA Sequence Version Archive (SVA) is a repository of all entries which have ever appeared in the EMBL Nucleotide Sequence Databank (EMBL‐Bank) or ENA Sequence databases. http://www.ebi.ac.uk/cgi‐bin/sva/sva.pl
9. Ensembl Gene Ensembl genome databases for vertebrate species and model organisms. For other species, see below. http://www.ensembl.org/
10. Ensembl Genomes Gene Genome databases for metazoa, plants, fungi, protists and bacteria. http://www.ensemblgenomes.org/
11. Ensembl Genomes Transcript Genome databases for metazoa, plants, fungi, protists and bacteria. http://www.ensemblgenomes.org/
12. Ensembl Transcript Ensembl genome databases for vertebrate species and model organisms. For other species, see Ensembl Genomes instead. http://www.ensembl.org/
13. European Patent Office (EPO) Proteins Patented Protein present in the European Patent Office. http://www.ebi.ac.uk/patentdata/proteins/
14. HGNC HUGO Gene Nomenclature Committee (HGNC) approved gene name and symbol (short‐form abbreviation) for each human gene. http://genenames.org/
15. IMGT/HLA The International ImMunoGeneTics (IMGT) database provides a specialist database for the sequences of the human major histocompatibility complex (HLA), including the official sequences for the WHO Nomenclature Committee For Factors of the HLA System. http://www.ebi.ac.uk/imgt/hla/
16. IMGT/LIGM‐DB A comprehensive database of immunoglobulins and T cell receptors (LIGM) from human and other vertebrates. http://imgt.cines.fr/cgi‐bin/IMGTlect.jv
17. InterPro The InterPro database (Integrated Resource of Protein Domains and Functional Sites) is an integrated documentation resource for protein families, domains, and functional sites. It was originally used to rationalize the complementary efforts of the PROSITE, PRINTS, Pfam and ProDom database projects, but now it also includes the SMART, TIGRFAMs, PIR SuperFamilies and most recently SUPERFAMILY databases. http://www.ebi.ac.uk/interpro/
18. IPD‐KIR A centralized repository for human Killer‐cell Immunoglobulin‐like Receptor (KIR) sequences. http://www.ebi.ac.uk/ipd/kir/
19. IPD‐MHC Sequences of the major histocompatibility complex (MHC) in a number of species. http://www.ebi.ac.uk/ipd/mhc/
20. IPRMC InterPro Matches Complete (IPRMC) for UniProtKB proteins. http://www.ebi.ac.uk/interpro/
21. IPRMC UniParc InterPro Matches Complete (IPRMC) for UniParc proteins. http://www.ebi.ac.uk/interpro/
22. JPO Proteins Protein sequences are appearing in patents from the Japanese Patent Office (JPO). http://www.ebi.ac.uk/patentdata/proteins/
23. KIPO Proteins Patented Protein present in the Korean Intellectual Property Office (KIPO). http://www.ebi.ac.uk/patentdata/proteins/
24. MEDLINE Comprises citations and abstracts records of more than 5000 medically related journals published in the United States and 70 other countries. The files contain over 19 million citations, dating back to the mid‐1940s, and are updated weekly. http://www.nlm.nih.gov/pubs/factsheets/medline.html
25. Patent DNA NRL1 Non‐redundant patent nucleotides level 1 (NRL‐1). Nucleotide sequences from patents clustered by 100% sequence identity over the whole length. http://www.ebi.ac.uk/patentdata/nr/
26. Patent DNA NRL2 Non‐redundant patent nucleotides level 2 (NRL‐2). Nucleotide sequences from patents clustered by patent family, and then by 100% sequence identity over the whole length. http://www.ebi.ac.uk/patentdata/nr/
27. Patent Protein NRL1 Non‐redundant patent proteins level 1. Protein sequences from patents clustered by 100% sequence identity over the whole length. http://www.ebi.ac.uk/patentdata/nr/
28. Patent Protein NRL2 Non‐redundant patent proteins level 2. Protein sequences from patents clustered by patent family and then by 100% sequence identity over the whole length. http://www.ebi.ac.uk/patentdata/nr/
29. Patent Equivalents Patent number equivalents (families) and patent classifications for patents containing sequence data. The patent equivalents are obtained from the patent numbers cited in the major sequence databases (e.g., EMBL‐Bank and Patent Proteins), which are then expanded into a set of patent equivalents forming a WIPO Simple Patent Family. http://www.ebi.ac.uk/patentdata/
30. PDB Comprises structure and sequence information of proteins and nucleotides. http://www.ebi.ac.uk/pdbe/
31. Reference Sequence project (RefSeq) All sorts of information on reference sequences of natural molecules. http://www.ncbi.nlm.nih.gov/refseq/
32. RefSeq (protein) All sorts of information on reference sequences of natural molecules. http://www.ncbi.nlm.nih.gov/refseq/
33. SGT Structural Genomics Targets (SGT) is a protein target registration database, providing information on the experimental progress and status of target amino acid sequences selected for structural determination. http://targetdb.pdb.org/
34. Taxonomy Taxonomic classification of organisms for which there are sequences in the INSDC databases (i.e., DDBJ, EMBL‐Bank, and GenBank) and many other biological databases. http://www.ncbi.nlm.nih.gov/Taxonomy/
35. Trace Archive An archive of capillary electrophoresis trace data. http://www.ebi.ac.uk/ena/
36. UniParc Protein sequences retrieval system. http://www.uniprot.org/
37. UniProtKB Curated protein information retrieval system. http://www.uniprot.org/
38. The UniProt Reference Clusters UniRef100/UniRef90/UniRef50 Access point for combined resemble sequences. In UniRef100, UniRef90 and UniRef50, no sequence mutual pair identity exceeds > 100%, > 90% or > 50%. http://www.uniprot.org/
39. UniProtKB Sequence/Annotation Version Archive (UniSave) Access point for UniProtKB/Swiss‐Prot and UniProtKB/TrEMBL admitted versions. http://www.ebi.ac.uk/uniprot/unisave/
40. United States Patent and Trademark Office (USPTO) Proteins Patented Protein present in the USPTO. http://www.ebi.ac.uk/patentdata/proteins/

THE EMBL TOOLS

This is the access and analysis point for numerous data resources through Web Services technologies (Li et al., 2015; Lopez et al., 2014). The program basically works on integration and inter‐operation technology and has been created from Representational state transfer (REST), Simple Object Access Protocol (SOAP) and Web Services Description Language (WSDL).

The details and description of EMBL services are given in Table 2.

TABLE 2 Description of various EMBL tools.

General Services
Including data retrieval, access various sequence, and structural databases
S.N. Service Description
 1. ArrayExpress Microarray data searching with ArrayExpress.
 2. ChEBI Web Services Entry retrieval from the ChEBI database.
 3. ChEMBL Web Services Retrieval data system.
 4. EB‐eye (SOAP)/(REST) EBI search engine (EB‐eye).
 5. ENA Browser Access point for sequence retrieval .
 6. Gene Expression Atlas API Access point for statistics data over a curated subset of ArrayExpress Archive.
 7. MartService Searching and retrieving the data through BioMart.
 8. PDBe (REST) Helps in gathering facts from PDB and EMDB.
 9. PSICQUIC Information retrieval system for molecular interaction, comprising ChEMBL, Reactome, and IntAct.
10. Rhea Access point for manually annotated chemical reactions information.
11. Universal Protein Resource UniProt.org Protein sequence information including annotated.
12. WSDbfetch (REST)/(SOAP) Identifier entry retrieval system.
Protein Functional Analysis (PFA)
Identifying protein‐related information, i.e., sequences, motifs, conserved regions, etc.
REST/SOAP Service Description
13. FingerPRINTScan Recognizing the proximal matching fingerprints motif.
14. InterProScan 5 This tool is used for bringing different protein signature recognition methods into one platform or page.
15. HMMER hmmscan Access point for Hidden Markov Models (HMMs) database.
16. PfamScan PfamScan is used to explore the similar sequences for a query FASTA sequence against a library of Pfam HMM.
17. Phobius Prediction of transmembrane topology and signal peptides from the amino acid sequences of protein.
18. Pratt Identifying conserved patterns in unaligned protein sequences.
19. PROSITE Scan Comparing a protein sequence against the signatures in PROSITE (both patterns and profiles).
20. RADAR Repeat identification and alignment system in protein sequences.
Sequence Similarity Search (SSS)
Provides the identification of homologous sequences.
REST/SOAP Service Description
21. FASTA Fast protein or nucleotide comparison access tool.
22. FASTM Peptide fragment access point from FASTA.
23. NCBI BLAST Nucleotide and protein sequence comparison system.
24. PSI‐BLAST Position Specific Iterative BLAST (PSI‐BLAST), guided mode
25. PSI‐Search Iterative Smith and Waterman using a PSI‐BLAST strategy
Multiple Sequence Alignment (MSA)
Alignment of a set of three or more, protein or nucleotide sequences.
REST/SOAP Service Description
26. Clustal Omega Sequence alignments tool.
27. ClustalW2 Global multiple sequence alignment of DNA and protein sequences using ClustalW2.
28. DbClustal Global multiple sequence alignment of DNA or protein sequences using anchor regions from BLAST results
29. Kalign Sequence alignment system of large sequences.
30. MAFFT Sequence alignment using the MAFFT method. Fast, and capable of handling large sequences.
31. Multiple Sequence Comparison by Log‐Expectation (MUSCLE) Sequence alignment tool.
32. MView Reformat a multiple sequence alignment or create a multiple sequence alignment from a sequence similarity search result (e.g., BLAST or FASTA).
33. PRANK Sequence alignment using the PRANK method.
34. T‐Coffee Sequence alignment using the T‐Coffee method.
Phylogeny
Phylogenetic analysis
REST/SOAP Service Description
35. ClustalW2 Phylogeny Neighbor‐joining or UPGMA phylogenetic trees access system.
Pairwise Sequence Alignment (PSA)
Alignment of two sequences
REST/SOAP Service Description
36. EMBOSS matcher Waterman–Eggert local alignment using EMBOSS matcher.
37. EMBOSS needle Needleman–Wunsch global alignment using EMBOSS needle.
38. EMBOSS stretcher Myers and Miller global alignment using EMBOSS stretcher.
39. EMBOSS water Smith–Waterman local alignment using EMBOSS water.
40. GeneWise Provides comparison of protein and genomic DNA sequence.
41. lalign Huang and Miller sim local alignment using lalign.
42. PromoterWise Comparison of two DNA sequences, allowing for inversions and translocations.
43. Wise2DBA The Wise2 DNA Block Aligner (DBA) aligns two DNA sequences.
RNA
RNA Analysis
REST/SOAP Service Description
44. Infernal cmscan Searching system for CM‐format Rfam database.
45. MapMi Accessing mapping and analysis of miRNA sequences.
Sequence Format Conversion
Convert between homologous sequences or confirm the formatting of a sequence.
REST/SOAP Service Description
46. EMBOSS seqret Accessing manipulated sequence entries.
47. MView Reformatting of multiple sequence alignment data.
48. Readseq Convert biosequences between a selection of common biological sequence formats.
Sequence Statistics
Analyze a sequence to determine its properties and use statistics to assign significance.
REST/SOAP Service Description
49. EMBOSS cpgplot European Molecular Biology Open Software Suite (EMBOSS) cpgplot identifies and plots CpG islands in a nucleotide sequence.
50. EMBOSS isochore Plots isochores in DNA sequences.
51. EMBOSS pepinfo Plots amino acid properties.
52. EMBOSS pepstats Provides calculation of protein properties.
53. EMBOSS pepwindow Generates a hydropathy plot for protein.
54. SAPS Statistical Analysis of Protein Sequences.
Sequence Translation
Translate a coding nucleotide sequence into a protein sequence and vice versa.
REST/SOAP Service Description
55. EMBOSS transeq Translates the nucleiceotide sequences.
56. EMBOSS sixpack Displays DNA sequences with six‐frame translation and ORFs.
57. EMBOSS backtranseq Back‐translates the protein sequences.
58. EMBOSS backtranambig Back‐translates protein sequences to ambiguous nucleotide sequences.
Structural Analysis
Analysis of macromolecular structures.
REST/SOAP Service Description
59. DaliLite Pairwise structure comparison.
60. MaxSprout Provides fast database algorithm for making protein backbone and side chain.
Literature and Ontologies
Look‐up ontology terms and navigate ontology relationships.
Service Description
61. BioModels Access point for mathematical models of biological interest.
62. PICR Protein Identifier Cross‐Reference Service.
63. QuickGO Gene Ontology (GO) and Gene Ontology Annotation (GOA) databases.
64. Europe PMC Web Service Provides searching access from Europe PubMed Central.
65. WSMIRIAM Web Services for the Minimal Information Requested In the Annotation of biochemical Models (MIRIAM).
66. WSOntology Lookup Search multiple ontologies from a single location.
67. WSSBO Web Services for the Systems Biology Ontology (SBO).
68. WSWhatizit permits text mining tasks.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset