Appendix C: NCBI Database: A Brief Account

CS Mukhopadhyay and RK Choudhary

School of Animal Biotechnology, GADVASU, Ludhiana

The information on each of the databases listed below has been collected from NCBI. In several cases, the description of the databases will be verbatim to that available in the NCBI pages. The information regarding these databases has been taken from NCBI‐guide ( and related sites.

1AssemblyThis database maintains and periodically updates organism‐wise information on assembled genomes (WGS) or complete chromosome sequence of prokaryotic and eukaryotic organisms.
2Bio projectThis database holds data and information related to a single project or a consortium. It enables users to obtain voluminous data belonging to a project, in one place. The type of records maintained in Bioprojects are Genome sequencing and assembly; Metagenomes; Genetic or RH maps; Targeted locus sequencing; Epigenetics; Phenotype or Genotype and Variation detection, Transcriptome sequencing and expression.
3BiosystemsA data repository of information (list, sequence, structure) regarding biological molecules (genes, proteins, small molecules) and pathways involved in biological systems. This includes data from BioCyc (including its Tier 1 EcoCyc and MetaCyc databases and its tier 2 databases), KEGG, Reactome, the National Cancer Institute’s Pathway Interaction Database, WikiPathways and Gene Ontology (GO).
4BookshelfA database of freely accessible electronic books and documents in life science and healthcare. It integrates NCBI resources such as PubMed, Gene, OMIM and Pubchem.
5ClinVarClinVar is the public repository of sequence variation, and information about its relationship to human health. ClinVar maintains records on various medical conditions due to genetic aberration(s) collected from a number of distinct sources, including SNOMEDCT, MeSH & OMIM, etc.
6Clone DBA public database that maintains information (sequence data, map positions, and distributor information) for clones associated with genomics, cDNA and cell‐based libraries belonging to different eukaryotic organisms.
7BiosampleA central repository of biological resources (including tissues, cell lines, experimental organisms) used in different assays.
8Computational resources from NCBI’s structure groupIt maintains access and links to resources (databases and tools) developed by the division of Biocomputational Structure Group of NCBI that determines macromolecular structures and identifies conserved domains. This resource also maintains tools for classification of protein, for determining small molecular biological activity and pathways analysis, etc.
9Consensus CDS (CCDS)A consensual collaboration among NCBI, EBI, University of California at Santa Cruz (UCSC) and Wellcome Trust Sanger Institute (WTSI) to identify and annotate a core set of protein‐coding regions.
10Conserved Domains Database (CDD)CDD, a protein annotation resource, holds models of well‐annotated multiple sequence alignment about primal domains, as well as the complete peptides.
11Database of Expressed Sequence Tags (dbEST)This is the EST database that contains short single‐read transcript sequences obtained from GenBank.
12Database of Genome Survey Sequences(dbGSS)This NCBI database contains comprehensively annotated short, single‐pass reads obtained for genomic sequences (which could be cDNA or non‐coding DNA) obtained from sources such as random survey sequences, clone‐end sequences, artificial chromosomes (BAC/YAC) or cosmids and exon‐ and gene‐trapped sequences.
13Database of Genomic Structural variation (dbVar)Maintains information regarding large‐scale genomic variation, namely sizeable InDels, translocations and inversions with regard to the association of these variations with phenotypes.
14Database of Genotypes and Phenotypes (dbGaP)This database archives and distributes the results of studies on the interaction of genotype and phenotype. The information pertains to molecular diagnostics, genome‐wide association studies (GWAS) concerning the association of genotype with non‐clinical traits. The GaP database also offers cloud computing services.
15Database of Major Histocompatibility Complex (dbMHC)Information on gene and related clinical data associated with Major Histocompatibility Complex (MHC) of human are maintained here. The tool dbMHCms searches for the portrayal for reported short tandem repeats (STRs) belonging to MHC. It has a “Reagent Database” section (reagent data needed to trace DNA typing) and a “Clinical” section (maintains clinical data from anonymous individuals sharing their clinical data in the project).
16Database of Short Genetic Variations (dbSNP)A public database for obtaining information regarding genetic variation within and across different species. SNP data obtained from several experiments, starting from physical mapping and association studies, pharmacogenomics to evolutionary studies can be submitted to dbSNP.
17EpigenomicsThis database holds epigenomic data on a biological sample, and also serves as a tool (as genome browser) for selecting, downloading and viewing multiple sets of epigenomic data.
18GenBankA public repository of annotated DNA sequences. The International Nucleotide Sequence Database Collaboration maintains the collaborative liaison among the DNA data of NCBI, EMBL and DDBJ. The FTP is updated every two months.
19GeneThis database integrates information on nomenclature, variations and reference sequences (RefSeqs), gene‐maps, molecular‐pathways regarding phenomes. This information is linked to genome‐, phenotype‐, and locus‐specific resources, with regard to highly divergent species. “Gene” can be accessed by querying on any word, restricting the query term to a certain field, or applying filters or properties.
20Gene Expression Omnibus (GEO) DatabaseA public repository of experimental data generated from microarray experiment and high‐throughput genomic data like next generation sequencing (NGS).
21Gene Expression Omnibus (GEO) DatasetsStores compiled gene expression DataSets, and original series, samples and platform records in the Gene Expression Omnibus (GEO) repository. The differential expression pattern is collated and displayed along with clustered heatmaps for easy comprehension.
22Gene Expression Omnibus (GEO) ProfilesMaintains the curated gene expression profiles belonging to the Gene Expression Omnibus (GEO) archive.
23GeneReviewsThis database, being a part of the GeneTests website, archives peer‐reviewed descriptions (diagnosis, counseling, etc.) of inherited diseases.
24GeneTestsThe repository is a knowledge base of diagnosis of the management of inherited diseases and genetic testing.
25Genes and DiseaseThis database contains the articles related to genetic diseases and the causative genes.
26Genetic Testing Registry (GTR)This acts as a repository of information on genetic tests, including premises, promises, methodology, validity, utility, challenges, etc. associated with the testing of inherited diseases which are submitted by the test providers voluntarily.
27GenomeThis database archives the sequences and related map data from the whole genomes of different organisms (bacteria, archaea, and eukaryota), including the genomes of completely sequenced organisms and not yet complete ones.
28Genome Reference Consortium (GRC)This international consortium includes the eminent research institutes working on unraveling the genomic information in terms of genome mapping, association studies, genome‐informatics, etc. with an aim to improve the human and mouse genome reference assemblies.
29HIV‐1, Human Protein Interaction DatabaseThis database harbors links to PubMed records on interactions between HIV‐protein and human‐protein vis‐a‐vis to relevant sequences.
30HomoloGeneA tool to identify the possible orthologs by comparing the homologous nucleotide sequences from different species.
31Influenza VirusHolds the data from the National Institute of Allergy and Infectious Diseases (NIAID), Influenza Genome Sequencing Project and GenBank, and maintains the NCBI Influenza Virus Sequence Database. Another important use of this database is the analysis of flu sequences, which are then submitted to GenBank following annotation.
32Journals in NCBI DatabasesA subset of the NLM Catalog database that maintains information on journals cataloged in PubMed and other NCBI database records.
33Medical Subject Headings (MeSH) dbA comprehensive catalog of medical vocabulary used for indexing journal papers and books in the life sciences. The database is used to search for MeSH terminologies, get their definition and pertinent information and strategy building for PubMed search.
34NCBI C++ Toolkit ManualA public domain library containing system‐independent (mostly) useful libraries, development framework, demos, release notes, etc.
35NCBI GlossaryContains definitions/portrayal of the tools available at NCBI, explanation of bioinformatic terms and acronyms, etc.
36NCBI HandbookIncludes exhaustive explanatory notes on NCBI databases and software, which can be accessed through NCBI Bookshelf.
37NCBI Help ManualA collection of Help documents (downloadable) on tools like BLAST, Entrez (search engine), GenBank (databank), PubMed and NLM, etc.
38NCBI Website SearchA search tool provided by NCBI to search documents, newsletters, sample codes and other resources at NCBI.
39National Library of Medicine (NLM) CatalogAn electronic library catalog that enables searching the bibliographic data for around 1.5 million journals, books, software, audiovisuals‐documents, etc. at National Library of Medicine, the largest online library of medical science.
40Nucleotide DatabaseThis maintains a vast repository of nucleotide sequences (gene/transcript/genome data) obtained from sources like GenBank, RefSeq, TPA and PDB.
41Online Mendelian Inheritance in Animals (OMIA)Textual information and references related to inherited disorders and associated genes in about 200 animal species are cataloged in this database. However, human and mice are not covered. The genetic disorders are linked to genes, and relevant literature (Pubmed) is also linked.
42Online Mendelian Inheritance in Man (OMIM)This database was developed to supply comprehensive information and reference on Mendelian disorders in a human being. The related genes, the relationship between genotype and disease phenotype are also detailed here. Each entry is linked to multiple genetic databases (gene and protein sequences), literature, genetic tests, mutation databases, etc.
43PopSetA repository of DNA sequences obtained from the members of a population (composed of individuals from different species or multiple species) to study their evolutionary relationship. One can submit DNA sequences to PopSet via Sequin of NCBI.
44ProbeA public database for maintaining detailed information on reagents used in nucleic acid experiments (RNAi, microarray, genotyping, gene expression, etc.) conducted for a vast array of biomedical research. This helps researchers from different parts of the globe to assess information about useful biochemicals, molecular probes, distributors, etc.
45Protein ClustersThe protclustdb (protein cluster database) maintains the clusters of RefSeq proteins from a variety of sources, including prokaryotic genome and plasmid, viruses, organelles, protozoa, and plants. The database consists of uncurated and manually curated cluster data, and is updated every three months. Cross‐references to related external links (NCBI‐COG, KEGG, InterPro, etc.) are provided for proteins and protein clusters.
46Protein DatabaseIn silico translated amino acid sequences from annotated coding sequences obtained from NCBI RefSeq, GenBank, etc., along with records from external sources of protein sequences, including SwissProt, PDB, PIR, etc. are maintained by this database. The GenPept sequence provides cross‐references to cds (if applicable), PubMed, etc.
47PubChemBioAssayPubChemBioAssay is one of the three components of NCBI PubChem (a search tool to determine chemical similarity). The PubChemBioAssay is a link to the PubChem compounds that elaborates their bioactivity, including describing the bioassays, screening conditions, etc.
48PubChem CompoundThis database depicts the structure of the validated substances of the PubChem substance page of NCBI. This page maintains pre‐clustered compounds based on similarity and links to related databases and information (structure information, references).
49PubChem SubstanceDescribes the contents of PubChem (structure, cross‐references, etc.) and provides links to biological screening results.
50PubMedThis is one of the most popular databases and repositories of NCBI‐NLM. It maintains biomedical books, as well as a wide range (including bioengineering and chemical sciences) of literature from different sources, including biological journals and MEDLINE. Each record is given a unique PMID.
51PubMed Central (PMC)Freely available biomedical literature are maintained by PMC.
52PubMed HealthThis is an archive of clinical reviews with an aim to cater to the clinicians and end users, so that they have access to research works directed towards biomedical and clinical issues.
53RefSeqGeneA subset of the RefSeq database, where the reference genomic sequences pertaining to human genes are maintained. The curations obtained from locus‐specific data, as well as information available from the genetic testing community, are included.
53Reference Sequence (RefSeq)This curated, non‐redundant database maintains naturally occurring nucleotide (DNA, RNA) and protein sequences from a large number of species regarding linked records, from genomes to transcripts and translation products.
54Retrovirus ResourcesA public resource of research works on retroviruses, this provides certain online tools (genotyping tool using BLAST algorithm; alignment tool for global alignment; annotated maps, etc.).
55SARS‐CoVData (regarding sequence, genome sequence alignments of various isolates) and information (publication) on the SARS coronavirus are maintained in this database.
56Sequence Read Archive (SRA)This database archives short sequences (<1000 bases) produced from high‐throughput sequencing, from massive parallel sequencing platforms, including Roche, Illumina, ABI SOLiD System, etc.
57Structure (Molecular Modeling Database)Macromolecular structures (from PDB) and visualization tools are available here. The Molecular Modeling DataBase (MMDB) or Entrez Structure DataBase (ESDB) stores the experimentally determined 3D structures of biomolecules.
58TaxonomyHolds standard nomenclature and scientific classifications of taxa from prokaryotic and eukaryotic origin. The species names are manually compiled for each of the organisms linked to the entries of INSDC (International Nucleotide Sequence Database Collaboration: GenBank + EMBL + DDBJ).
59Third Party Annotation DatabaseThe TPA database aims at maintaining and providing experimental (peer‐annotated from evidence of wet‐lab experiment) or inferential (not from direct wet‐lab experimentation) results. It derives the TPA‐sequence from already‐available GenBank sequence data, and also annotates the sequences.
60Trace ArchivesThis public repository has three sections:
Sequence read archive: to store NGS data from a variety of NGS platforms;
Trace Archive: sequencing data from gel or capillary sequencer;
Trace assembly archive: assembles the reads of sequencing by pairwise or multiple sequence alignment.
61UniGeneA repository of transcriptome sequencing reads obtained from expressed genes or pseudogenes. Each entry links to all the encoded transcripts from the same locus, and provides information about gene expression and genomic location, complementary DNA, and protein similarity.
62UniGene Library browserA database that enables users to browse the expressed sequence tags with respect to the organisms, tissue type, and stages of biological development.
63UniSTSExperimentally derived sequence tagged sites (STS) are archived in this comprehensive database.
64Viral GenomesCurated virus genome sequences are maintained in this database.
65Virus variationAn organized collection of viral genome sequences with an aim to extend facilities for easy search, retrieval, display, and analysis of virus genomes. It provides pipelines for analysis of viral genomes to assist discovery using the available sequence data.
