3
Bioinformatics and Its Applications in Genomics

David J. Parry‐Smith

Wellcome Sanger Institute, Hinxton, UK

3.1 Significance and Short Background

This chapter introduces the field of bioinformatics, which is a scientific discipline dealing with the analysis of biological data. More specifically, we deal with bioinformatics as it is applied to the field of genomics.

Biological data can take many forms, including DNA, RNA and protein sequence information. It can encompass higher level collections of data and analyses of data. These include databases of structurally and functionally relevant sequence patterns and databases of small molecule ligand binding sites. It can also encompass imaging of a wide variety of processes, including X‐ray diffraction data and images of the three‐dimensional structures of DNA, RNA and protein complexes. New forms of biological data are being generated all the time as new experimental approaches are developed. Analysis of the data derived using these techniques is underpinned by a sound understanding of how bioinformatics relates to the functioning of the cellular machinery, whole organisms (e.g. genetics) and even the evolution of species (e.g. phylogenetics).

Data analysis has always been fundamental to scientific understanding. Observation leads to classification and generalisation. Rules emerge that enable us to explain the way systems behave now and to predict how they may behave in the future. Such systems range from tracking the course of the planets and stars across the sky to the quantum behaviour of fundamental particles in an atom of helium, say. The development of technology has a major impact on the amount of data available to review and analyse.

In the field of bioinformatics, data gathered on the primary sequence of proteins and the order of the bases comprising the sequences of the DNA of genes have resulted in substantial repositories that are freely available to the public for exploration. The teams of scientists that isolated and cloned the individual genes whose sequences were deposited in these data resources published their work in the scientific literature, at the same time depositing their data in a sequence database. Scientists interested in determining the three‐dimensional structure of the protein that the gene expresses also deposited their data upon publication, but in databases more appropriately designed to hold such data. Bioinformatics is fundamentally involved in providing the means for collating (gathering), analysing and curating (maintaining or looking after) the databases of structural coordinates and DNA or protein sequence. These resources are referred to as primary resources because they contain the actual data determined by the experimental science. Secondary resources are databases that corral information gleaned from analysis of the primary resources – such as a database of gene families or a database of conserved patterns of residues in protein families (sometimes referred to as sequence motifs defining functional or structural domains).

In the late twentieth century, a more holistic approach was taken to the production of data that contributes to the primary resources. Instead of focusing on individual genes, whole genomes of organisms important to medical research were sequenced. This resulted in a very rapid expansion of the sequence data available in sequence databases of all types. Annotation of the databases became a key function of bioinformatics. The major centres for warehousing sequence data (EMBL‐EBI in Europe, Genbank in the USA) have substantial ongoing programmes of work in annotation. Whereas, formerly, only the sequences of the cloned genes tended to be available, now sequence information related to those parts of the genome involved in the control of gene expression and functional but non‐coding regions is readily accessible. This is because whole genomes are being sequenced. In addition, other parts of the genome (previously referred to as ‘junk DNA’, but in fact whose function has simply yet to be determined) are now available for analysis.

It took a long time to sequence the first whole genomes compared to the current rate of sequencing. The human genome sequence (itself a mosaic of multiple individuals and not representative of any single person) took many years to complete, being officially begun in 1990 and declared complete in 2003, with initial publication in 2001 [1]. Now, in the second decade of the twenty‐first century a whole genome sequence can be obtained for a specific cell line in a few days. The bioinformatics required to deal with the data from such experiments has had to be developed to cope with the amount of data involved. It is interesting to note that although computing capacity has increased to a staggering extent over the last few years, sequencing capacity has far exceeded this increase. In fact, today we do not normally think of library construction and sequencing for whole genome sequencing (WGS) to be experimental, as systems are available that enable a factory approach to WGS. Projects to generate WGS for the UK BioBank of 50 000 genomes (www.ukbiobank.ac.uk), 100 000 genomes (www.genomicsengland.co.uk) and 2 million genomes (AstraZeneca in partnership with other providers) are in progress [2]. Other commercial approaches (e.g. Human Longevity, Inc.) are amassing substantial proprietary databases of genome sequence data.

3.1.1 Big Data

The term big data is applied to large volumes of biological data, as it is to data derived from financial or astronomical domains. Each area generates petabytes of data on a regular basis. As in all other areas of big data endeavour, the computational techniques appropriate to the size of these data streams are appropriate to apply. These include artificial intelligence techniques of linear and non‐linear regression, classification and machine learning.

3.1.2 Computational Challenges

There are significant challenges in analysing sequence data for genomics. These include:

  • Computer hardware utilisation, for example, exploitation of graphics chips (GPUs) designed for rapid matrix calculations (as demanded by machine learning)
  • Data storage, covering parallelised file systems, object stores and hierarchical storage
  • Software development, necessitating the expertise in writing code for making effective and efficient use of hardware and ever increasing volumes of data.

Cloud‐based approaches are coming to the fore, with commercial organisations offering viable solutions and research institutions implementing internal cloud‐based flexible compute environments. This approach enables a timely response to the rapidly changing flow of data and the information and knowledge derived from it. In terms of software advances, programming for machine learning and artificial intelligence has much to contribute in identifying patterns and trends when analysing large quantities of genomic data. Comparison of whole genome sequences and variant call files (VCFs) resulting from these large sequencing projects are areas of active research [3].

3.1.3 Bioinformatics Roles

Bioinformatics scientists who assist in the design of experiments for research programmes both large and small rely on the work of the primary resource annotators. These annotators are proficient at reviewing the literature related to the part of the genome they are annotating, and running and interpreting the results of sequence alignment software. A second group of bioinformaticians, who are skilled at using the many tools of bioinformatics, assist in the selection and generation of oligonucleotides (e.g. polymerase chain reaction [PCR] or sequencing primers) for generating genetic engineering designs. These designs are key to the execution of experimental work in the laboratory. They use the annotation of genomic sequences by focusing on individual genes to support experimental design.

As the scope of the research programmes becomes more ambitious, higher throughput systems are put in place, including laboratory automation and high throughput techniques based on multiwell plates (e.g. 96‐well and 384‐well plates). The amount of material generated by these high throughput processes demands ever higher capacity in sequencing capability to confirm what has happened in these experiments (known as genotyping). Thus there are additional bioinformatics roles involved in putting together pipelines of bioinformatics tools (aligners, quantifiers, visualisations, further database storage of results, laboratory information management systems [LIMSs], reporting systems and so on).

Additional roles encompass the analysis involved in the related areas of transcriptomics (the study of the complete transcript set from cells) and epigenomics (the study of all the epigenetic modifications to the cell's genetic material, including methylation of DNA and histone modification). These roles may involve assessing which genes in a whole genome are affected when a cell is subjected to some stress, like a drug or heat or light. Again, annotation is key to accurately map the transcripts back to the correct gene or non‐coding RNA. Epigenomics and transcriptomics data also drive a more accurate annotation of the genome.

Therefore, some bioinformatics scientists will be intimately involved in experimental design and, indeed, many laboratory‐based scientists will have significant bioinformatics skills in their area of expertise. Even now, most experiments contain at least some bioinformatics analysis. The sheer bulk of data means that it is often computational analysis that leads the experimental hypotheses, or at least allows for more rapid removal of hypotheses that are wrong. Other bioinformatics scientists will be much farther removed from the molecular biology in the laboratory. Some will need advanced computational and software development skills; others may be able to rely on publicly available or commercial bioinformatics tools to support them in their experimental design or analytical work.

Different business needs will also define the role of the bioinformatics scientist. In a pharmaceutical preclinical research team, bioinformatics may be used to understand the evolutionary relationships between members of a family of genes. This helps determine whether a specific gene is being targeted by a chemical compound (e.g. a drug) or whether the compound may affect an unintended but closely related member of the protein family. Bioinformatics may be used in the following ways (among others):

  • Curate in‐house databases of drug targets (generally proteins, which are normally enzymes catalysing key reactions).
  • Select antigenic sites for potential antibody‐based therapies.
  • Design gene editing protocols in a precision genome engineering context to support a drug discovery programme.

Some research contexts may require extensive knowledge of structural biology and protein DNA complexes as well as complexes involving RNA, other proteins or small molecules. This area pushes more into the disciplines of biophysics and computational chemistry, which are both close scientific neighbours of bioinformatics. Images and the metadata that are assigned to them by technicians and clinicians are important in the context of clinical bioinformatics.

Whatever the context of the bioinformatics, there are some fundamental theoretical concepts that should be understood by all scientists as being characteristic of bioinformatics as a science.

3.2 Theory/Principles

3.2.1 Molecular Biology

Whatever the type of bioinformatics role and whatever the scale of data to be analysed, all must understand the fundamental molecular biology that underpins the biology and the generation of biological data. Core to this understanding is the central dogma of molecular biology: ‘DNA makes RNA makes protein’ (a useful, if oversimplistic, memory phrase), but more formally ‘DNA is transcribed to produce RNA which is translated to produce protein’. Fundamentally, the central dogma is about the flow of information from DNA, via RNA to the protein product (as stated by Francis Crick in 1956).

From the point of view of bioinformatics, there is a one‐way flow of information from DNA to DNA, from DNA to RNA and from RNA to protein. There is no flow of information from protein back to RNA or DNA. Proteins can and do interact with DNA and RNA but there is no information transfer. The information we are describing here is the sequence of DNA, segments of which are transcribed to produce an RNA template. This is then translated three bases (a codon) at a time into the amino acids that form the single polypeptide chain of a protein. Knowing the sequence of bases in a stretch of DNA (or RNA) allows us to predict the resulting protein sequence. Knowing a protein sequence does not allow us to predict with any certainty the sequence of bases that generated that protein. This is because the genetic code is a redundant code with multiple codons for the same amino acid residue. There are only two exceptions: there is only one codon for methionine (M), which has a special role as the first residue in a nascent polypeptide sequence in RNA translation, and only one codon for tryptophan (W). These two amino acids are generally the least commonly observed residues, although, in individual protein sequences, biophysical requirements may mean that other residues have a lower relative abundance. In fact, there is a large degree of variability in the total abundance of amino acids in a protein. Cysteine is a specific example of an amino acid that is more commonly observed in the shorter sequences of the polypeptide hormones. Here, the strong covalent bonding of the disulphide bridge is necessary to stabilise the structure of the smaller molecule in an extracellular environment. This is one of the major reasons for the existence of disulphides. In genomic bioinformatics, we are always mindful that the workhorses of the cell are proteins.

You should use reliable resources from the reference list at the end of this chapter to review the following topics:

  • The structure of DNA
  • The four DNA bases (A, T, C, G)
  • Complementary base‐pairing in DNA
  • The four bases of RNA (A, U, C, G) and their base‐pairing
  • The production of messenger RNA (mRNA) from the DNA template in the genome
  • The production of the primary protein sequence (consisting of the 20 naturally occurring amino acids: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y)
  • The physical characteristics of the amino acid residues (see Table 3.1).

Table 3.1 The one‐letter, three‐letter and full names of the amino acid residues.

One‐letter code Three‐letter code Name No. of codons 1st base 2nd base 3rd base Properties
F Phe Phenylalanine 2 T T T, C Aromatic,
hydrophobic
L Leu Leucine 6 T
C
T
T
A, G
T, C, A, G
Aliphatic,
hydrophobic
I Ile Isoleucine 3 A T T, C, A Aliphatic,
hydrophobic
M Met Methionine 1 A T G Hydrophobic
V Val Valine 4 G T T, C, A, G Aliphatic,
hydrophobic
S Ser Serine 4 T C T, C, A, G Ambivalent
P Pro Proline 4 C C T, C, A, G Hydrophobic,
small
T Thr Threonine 4 A C T, C, A, G Ambivalent, tiny
A Ala Alanine 4 G C T, C, A, G Ambivalent, tiny
Y Tyr Tyrosine 2 T A T, C Ambivalent,
polar
H His Histidine 2 C A T, C Basic,
hydrophilic
Q Gln Glutamine 2 C A A, G Ambivalent,
polar
N Asn Asparagine 2 A A T, C Ambivalent,
polar
K Lys Lysine 2 A A A, G Basic,
hydrophilic
D Asp Aspartate 2 G A T, C Acidic,
hydrophilic
E Glu Glutamate 2 G A A, G Acidic,
hydrophilic
C Cys Cysteine 2 T G T, C Ambivalent,
small
W Trp Tryptophan 1 T G G Hydrophobic,
Large
R Arg Arginine 6 C
A
G
G
T, C,A, G
A, G
Basic,
hydrophilic
S Ser Serine 2 A G T, C Ambivalent, tiny
G Gly Glycine 4 G G T, C, A, G Ambivalent,
smallest
* Stop 1 T G A (See legend)

The codons for the residues can be deduced by scanning across the row for the residue and writing down the first and second bases shown and then adding the base(s) from the third base column to each pair. Therefore, for Arginine, if the first base is A, the second is G followed by either A or G (AGA, AGG). Note that it is the second base that is most critical in determining the properties of the amino acid. The third base can vary considerably, while often still resulting in the same residue. TGA was found in 2007 to code for Sec (selenocysteine) but requires a cofactor to be present. TAG can likewise code for pyrrolysine. Premature stop codons (TGA) cause shortened forms of polypeptides to be synthesised and are hence known as nonsense codons. The properties column is derived from the classification in [4]. * is the one letter amino acid code for the TGA codon.

Note that generally in bioinformatics the one‐letter codes for the amino acid residues are used. See Table 3.1 for the one letter, three letter and full names grouped according to physical characteristics along with their codons.

Fundamental to the skill of bioinformatics is the ability to handle sequence data either for an individual gene or part of a gene and its protein product. Therefore, we need to be able to convert information at DNA or RNA levels to protein sequence efficiently. We can then work on multiple genes, or indeed whole genomes, effectively. This process is known as three‐frame translation (or six‐frame if both strand directions are taken into account). The longest open reading frame (ORF) – that is, the longest sequence of residues translated without encountering a stop codon – is generally taken to be the standard protein product of the nucleic acid sequence. See Figure 3.1 for an illustration of the translation tracks turned on in the Ensembl genome browser and Figure 3.2 for a closeup zoomed in detail in which the translations in all six reading frames are made clear.

Image described by caption.

Figure 3.1 An example of a genomic region visualised in the Ensembl genome browser. The blue bar across the middle of the view gives the contig ID and location. A contig is a fundamental unit of assembled sequence derived by aligning individual sequence reads. The specific assembly technique used depends on the underlying sequencing technology: typically, Sanger sequencing, or next generation sequencing (NGS). The coloured display immediately above and below the blue line indicates the colour coded sequence of the forward and reverse strands of the DNA.

Image described by caption.

Figure 3.2 A zoomed‐in section from Figure 3.1. When the zoom level changes, additional tracks become active that were not apparent in Figure 3.1, including the three possible translations on the forward strand shown above the contig bar and the three translations on the reverse strand below it. The colour coding of the DNA bases and the amino acids in the diagram can be a useful visual aid, or it can be turned off. Many different colouring schemes have been proposed and the selection of specific schemes can be helpful in certain analyses.

3.2.2 Gene Structure

Different genomes have different levels of complexity. The human genome is perhaps surprisingly not the most complex genome we know. It has a relatively low complement of genes (around 23 000) compared to some other organisms. It is the complex structure of human genes that makes for a combinatorial explosion of gene and protein products. The processes involved result from regulation of gene expression and splicing of exons and introns. The memory phrases ‘exons are expressed’ and ‘introns interrupt’ can be helpful. Introns are eventually spliced out by the editing machinery of the cell. Many human genes, for example, have multiple splice variants that are made up by editing together specific exons of one particular gene. However, the most complex example of splice variation known is the gene DSCAM in Drosophila melanogaster. With 38 000 splice forms, this single gene has more splice variants than there are genes in the entire D. melanogaster genome. This can make analysis a complex process. Not all splice variants are known and many are tissue, or even cell specific – that is, they are only expressed in certain cells and at certain stages of the cell cycle (or upon cell–cell signalling or stresses). When conducting an analysis, we may refer to the canonical sequence. This is a sequence that reflects the most commonly observed base at a position.

3.2.3 Software Development Note

Bioinformatics scientists involved in programming applications for analysis should note that application programming interfaces (APIs) are available to assist in gathering the data required for conducting a bioinformatics analysis. These can be used in scripts and programs written in scripting languages like Perl and Python. For example, in the ENSEMBL core API, the canonical sequence refers to a commonly observed arrangement of exons in a particular transcript. Other transcripts may be observed, or might be predicted to occur based on evidence of splicing observed in sequencing projects. The API calls can be used to determine which transcript is most appropriate to use. Discussion with domain specialist scientists may also be necessary at this point in a project.

3.3 Databases

In Section 3.1 we talked about databases without giving a formal definition. A database is an organised or structured collection of data held in a computer. The structured nature of the database means that it will be made up of entries consisting of the same overall form but where the detail differs between entries. A simple database format might consist of records with ‘internal_id’, ‘gene_symbol’, ‘description’ and ‘sequence’ attributes. Attributes are sometimes also known as ‘fields’ or ‘keywords’. There may be hundreds of entries using these keywords as field tags with a separator (perhaps a colon) followed by the data. Alternatively, the format can be specified separately and the data would then be expected to follow this format for each record. The format for a gene in the European Nucleotide Archive (ENA) is more complicated because it deals with complex data relationships and cross‐references and has evolved over time. The opening of an entry is shown in Figure 3.3 and the conclusion of the entry – including the cDNA sequence – in Figure 3.4. Note that, in molecular biology, cDNA is DNA synthesised from an mRNA transcript using an enzyme termed ‘reverse transcriptase’ (RT). In bioinformatics, cDNA is used to refer to an mRNA transcript's sequence, expressed as DNA bases (GCAT) rather than RNA bases (GCAU). In this database entry, observe that the sequence is in lower case letters and, even though the entry is mRNA, it uses the alphabet of DNA (‘agct’) rather than RNA (‘agcu’).

No alt text required.

Figure 3.3 Extract of an entry from the European Nucleotide Archive (ENA) illustrating its structure. Here the individual records of the entry are tagged with two‐letter codes: ‘XX’ indicates a blank line and ‘ID’ is the first line and follows a specific format of its own with fields separated by semicolons.

No alt text required.

Figure 3.4 The ending of the ENA entry from Figure 3.3. There are many feature ‘FT’ records that are formatted to give detailed information on the origin of the sequence, organism, at which stage in the DNA→RNA→protein information flow we are. There are also several cross‐references to other databases (‘db_xref’ fields). The sequence itself is presented in a singly tagged ‘SQ’ record. The entry ends with the ‘//’ separator. The ENA database files can be downloaded and all the entries are separated in this way. This format is called a flat file, as opposed to the format it takes when loaded into a relational database management system (RDBMS) such as that shown in Figure 3.5.

Sequence databases are held in a compressed flat file format (meaning a human readable text‐based format) with keywords indicating the different fields within an entry. The compression saves space on file servers and reduces transfer time using file transfer systems, such as ‘ftp’, which is a program that can be used to transfer files between computers using the file transfer protocol (FTP). This type of flat file can be used to populate a database management system (DBMS). The role of the DBMS is to ensure integrity of the data (the data only changes when we want it to) and efficiency of access to the data in the database from programs (such as are implemented by web‐based query systems and genome browsers). Logical components of DBMSs are indicated in Figure 3.5. See Figure 3.6 for an example of part of a relational database schema visualisation that indicates relationships between tables and attributes in a LIMS.

Image described by caption.

Figure 3.5 The main features of a database management system (DBMS) comprise the database server and associated administrative features that permit multiuser access to the system. Users access the system through a command interface or web‐based interface. Developers have a programmatic interface that can be used to develop scripts that can access the database server to store and retrieve data. Relational database management systems (RDBMSs) use SQL (structured query language) to form queries that are processed by the database client software and passed to the database server for optimisation and execution. Flat files, such as those illustrated in Figures 3.3 and 3.4, can be parsed by specially developed scripts and loaded into the database using a schema specifically designed to hold the data and express the relationships between attributes. SQL provides a flexible tool for querying such data.

Image described by caption.

Figure 3.6 A visualisation of part of the schema of a database (actually a laboratory information system – another example of a bioinformatics application). The rectangles represent tables in the database, which is an SQL database using PostgreSQL (www.postgresql.org) as the database engine. No data are shown here as this is the metadata (data that describes data) that defines the logical relationships between attributes of the data. The lines indicate relationships explicitly defined by foreign keys (attributes in one table that point to and depend upon attributes in another table). It is these relationships that enable complex processes to be modelled in the database and make for effective storage and querying of the data.

Database formats are specific to the type of data being stored. Use online resources to explore a number of different types including a protein sequence, DNA sequence, protein structure and protein families. This will help to gain an understanding of the breadth of data that is being managed in these databases.

3.3.1 Accessing and Using Data

Data are made available using the Internet and are typically accessed using various web browsers from many providers. Much of the data used in bioinformatics is available publicly and freely via file download sites. These are typically FTP servers, but as file sizes get ever larger alternative methods of distribution are sometimes preferred. Cloud‐based services from a number of vendors are available that provide a flexible compute capability alongside databases. Data can thus be analysed in one place without having to replicate it across multiple sites. There are, of course, costs associated with computing in the cloud as there are in bringing data in‐house and providing computational facilities to analyse it.

You should be familiar with at least one web‐based genome browser. The browsers at www.ensembl.org and genome.ucsc.edu are good places to start exploring and we use the Ensembl genome browser to illustrate several techniques in the following section.

3.4 Techniques

The first technique we will look at is the use of a genome browser and a gene information database to understand something about the human BRCA1 gene.

3.4.1 Genome Browsers

Here we will use the www.ensembl.org site, which provides a fully capable genome browser enabling users to explore the publicly available data for specific genomes. Often bioinformaticians will work on a single gene or family of genes. Scientific colleagues working on a project will already have the specific code to look up for a gene. As a bioinformatician, you will want to become familiar with the background to the project, which suggests a wider search. Assuming that the project is researching breast cancer genes, we would probably first use a trusted web search engine to get a general impression of the work going on in the general area of ‘breast cancer genes’. This may seem obvious and unnecessary but the effort that large companies put into development of search engines can shortcut a lot of hard bioinformatics work as new resources are coming online all the time.

Next, use a tool like the Open Targets Platform (which is freely accessible at www.targetvalidation.org) to perform a more targeted search resulting in the output shown in Figure 3.7.

Image described by caption.

Figure 3.7 A page from the Open Targets Platform (www.targetvalidation.org) showing the results of a search that started out as ‘breast cancer genes’ and led to ‘hereditary breast cancer’. The top two gene target symbols are BRCA2 and BRCA1. This gives a useful way into assessing the extent of information available and then the analyst can drill down further and explore additional resources.

The first thing to be aware of is that gene symbols change over time. What we currently call BRCA1 has synonyms in the scientific literature. These synonyms are listed in the Ensembl page for the human BRCA1 gene. The Ensembl page (a portion of which is shown in Figure 3.8 and a part of the scrollable interactive view in Figure 3.9) is a very functional and beautiful representation of a great many pieces of data brought together to provide a rich environment for genomic exploration. Note the number of transcripts and the link to splice variants. This is a complicated gene, which has been observed in 30 different isoforms. There are 97 versions of this gene in other organisms. These are known as orthologues – genes that have evolved from a common ancestral gene over time as populations also evolved to become distinct species (termed speciation).

Image described by caption.

Figure 3.8 The beginning of the Ensembl genome browser page for the human BRCA1 gene. Note the navigation aid at the left of the page giving access to more detail. Of particular interest are the number of splice variants observed for this gene. There are four synonyms in the literature. The Ensembl accession code for this entry is ENSG000000012048. Databases assign their own accession codes.

Image described by caption.

Figure 3.9 The section of the BRCA1 entry in Ensembl that displays an interactive scrollable and clickable browser for homing in on specific regions of the genome around the BRCA1 gene and its transcripts. The display in the lower section of the figure shows the structure of introns and exons (defined in the main text) that make up many of the BRCA1 splice variants. This gene is on the minus strand, which is why it is shown below the blue dividing bar. Plus strand genes would appear above the blue bar.

3.4.2 Sequence Comparison

We compare sequences (whether DNA, cDNA, RNA or protein) on a regular basis. We do this to discover common regions, understand overall structures, predict functionality of unknown regions from similarity with known ones, understand evolutionary relationships, etc. Sequence alignments using sequences of a similar length and showing a high degree of sequence similarity are easier to interpret. Given a sequence to investigate, we would run a database search (normally using one of the Basic Local Alignment Search Tool (BLAST) suite of programs, viz. www.ensembl.org/Homo_sapiens/Tools/Blast) to find related sequences. Comparing a shorter sequence (perhaps of a few hundred bases) with a much longer sequence – say a chromosome millions of bases long – is a different task and uses specially configured database search tools.

Whichever tools we use, the concepts of local and global alignment (see Section 3.2.3) become important. In genomic bioinformatics, we find ourselves comparing DNA with DNA much of the time. In other areas of bioinformatics, we may use protein/protein comparisons more. The basic rules for comparing sequences are the same in either case, but the alphabet for proteins is more extensive than for DNA. DNA comparison programs tend to be optimised for finding short stretches of identical sequence and then stringing these regions together. Protein level comparisons will normally use tables of similarity scores (‘scoring matrices’) to assess regions of similarity and scoring priority.

3.4.2.1 Similarity and Homology

When considering lists of hits and the relevance of results, it is important to take into consideration both the statistical significance (denoted by a score in the BLAST output) and also the biological context in which the result will be used. This is why multiple sequence alignment and some understanding of sequence similarity are important tools for the bioinformatician. It is easy to jump to conclusions about family relationships (homology) when parts of two sequences happen to share some sequence similarity. Language is important here.

3.4.2.2 The Dotplot

A useful visualisation tool is the dotplot. Here, two sequences (either DNA or protein) are laid out one on the X‐axis and the other on the Y‐axis of Cartesian space. We end up with a rectangle of cells that are marked if the XY positions have identical bases/residues. Thus we assess each cell in turn and visualise the result. More sophisticated algorithms introduce a sliding window, with a number of positions contributing to the score at any particular cell. A self/self sequence comparison is shown in Figure 3.10. Dotplots are very useful when reviewing sequence data at the genomic level following WGS as they clearly indicate structural variants (see Figure 3.11).

Image described by caption.

Figure 3.10 Dotplot of the DNA sequence of the Human Zinc finger transcription factor. This is a self/self comparison so the dotplot is square with the major diagonal (bottom left to top right) being completed because each base is identical with its counterpart in the other sequence. Regions of similarity within the sequence are identifiable by clusters of dots forming defined dark areas off the major diagonal. The plot is symmetrical, the upper left triangle being identical to the lower right triangle. If two different sequences were being used, there would be no such symmetry. This plot was generated by the author using the R package ‘seqinr’, which provides the dotplot function and allows variation of the window length (set to 5) and the number of matches per window (4) for a dot to show up in the plot. Window averaging is required for DNA sequences because of the inherently poor signal to noise ratio in DNA sequences. This is because of the four letter alphabet of DNA. The accession code analysed is: ENST00000322945.10 MAZ‐202 cdna:protein_coding.

Image described by caption.

Figure 3.11 A matrix view (dotplot) of a structural variant from a chromium 10X dataset in which both haplotypes have been sequenced. There is a deletion on one of the alleles – hence the large amount of white space off the diagonal. This technology gives longer reads than the standard NGS approach through the concept of linked‐reads (see text). The tools for analysing the data (10× Longranger) are provided by 10× Genomics (www.10xgenomics.com). This emphasises the power of relatively simple tools (the dotplot) in bioinformatics analysis and visualisation.

3.4.2.3 Local and Global Alignment

When the overall, global alignment quality between two sequences is important the Needleman–Wunsch algorithm is used. It uses a dynamic programming technique to find its way through a matrix – much like the dotplot matrix. This technique tries to find the best alignment for all the residues in both sequences. However, often it is the local alignment of important features that we are most concerned with and the Smith–Waterman alignment technique is used. For database searching, it is necessary to optimise the implementation of the local alignment search in order to get results back to the user in a reasonable time. This requirement led to the development of BLAST, which remains the most popular means of searching a database with a query sequence (see the detailed discussion in [5]).

3.4.2.4 Multiple Sequence Alignment

Often we want to align several sequences that may be related to each other structurally or functionally. The multiple sequence alignment (using automated tools such as Clustal Omega [6]), often supported by some form of knowledge‐led editing in an interactive system (e.g. Cinema [7]), is an important technique. This approach relates more than two sequences to one another. Ultimately it leads to development of discriminators and profiles for seeking out and classifying sequences, see InterPro for an integrated resource [8]. Multiple sequence alignment is an important part of the process of phylogenetic analysis, which helps to chart the evolutionary history of sequences and their functions and conserved features. In DNA sequence analysis using short read techniques, multiple sequence alignment has its equivalent in the aligned read file (VCF – see Section 3.4.2.5).

3.4.2.5 Short Read Aligners – Variant Call Format (VCF)

Short sequences generated by DNA sequencing systems are known as short reads. When these short reads are generated, they are made available in a number of formats. A common format is known as the fastq format and is often generated by sequencing systems. This is a basic sequence format in which the sequence is listed on one line and an associated line of the same length is used to denote the quality of the base call at each point in the sequence. The sequence is called from the reads by base calling software. Several base callers exist and parameters can be set to determine how many calls of a particular base at a specific location are required to gain a specific quality score. The variant call format is a file format that enables aligned reads to be summarised and interchanged between different programs for assembly and processing.

Sequence files generated by next generation sequencing (NGS) systems can be very large. Fastq is not a particularly efficient means of storing the called data. The Samtools project (samtools.sourceforge.net) provides many tools for compressing and converting between various DNA sequence formats. The SAM (sequence alignment/map) format is a generic format for storing large nucleotide sequence alignments [9].

3.4.2.6 Genome Aligners – BWA

The current raft of sequencing technologies are mainly based on sequencing many short sequences (∼100 to 200 or possibly 300 bases) of DNA known as short reads. New technologies focused on generating much longer read lengths of individual DNA molecules are in progress, but current technology makes it quick and relatively cheap to generate sequencing libraries and sequence gigabases of short read DNA. The drawback is that the large amount of sequence is broken up into these very short reads, which must then be compiled into longer stretches. Alignment is a time‐consuming process, especially when dealing with the billions of reads produced by the latest sequencing technologies. The Burrows Wheeler algorithm (BWA) is an indexing scheme that was developed for compressing data very efficiently. When dealing with billions of bases of a genomic sequence, compression becomes a useful paradigm in making analysis tractable. BigBWA [10] is an example of an implementation of BWA for genomic alignment from NGS data. It combines a technology called Hadoop – an implementation of the MapReduce programming model for distributing a big data problem across available resources. In the map phase the reads are split into subsets that are processed using the BWA technique; in the reduce phase the multiple output files are reduced into one unique solution.

Short read lengths restrict the resolution of complex regions containing a highly repetitive sequence, heterozygous sequences (sequences that exhibit differences on both alleles of a diploid species), structural variations and genome assembly. The human and mouse genomes were sequenced originally using the much longer reads generated by the Sanger sequencing method, which produced contig sizes of 50–80 Mb (megabases). Short read contigs are typically an order of magnitude smaller at between 10 and 100 kb (kilobases). A contig can be thought of as a block of aligned sequence. When sequence lengths are short, it is likely that there will be a gap between some of the reads that cannot be spanned by another read. At this point the contig is broken and a new one is started. The ordering of contigs is unknown without a longer read to align and join them up. NGS technologies, however, provide depth of coverage as opposed to the average sequence read of Sanger sequencing. This means that the quality of bases in a read and contributing to a contig can be assessed more accurately.

3.4.2.7 Long Read Technologies

In the previous section we came up against the issue of short read NGS contigs that cannot be linked together to form a longer contig. A longer read can be used to link such contigs. Methods of generating longer contigs can be achieved by generating longer reads in the first place. The single molecule real time (SMRT) system can produce reads of 10 000 bases (cf. 100–300 bases for short read NGS). Aligning such reads to form contigs can help with structural variation and haplotype calling (see www.mlo‐online.com/long‐read‐sequencing).

The alternative NGS method is to link together short reads based on molecular barcodes that indicate which long stretch of DNA a group of short reads originates from. The structural variation plot in Figure 3.11 shows the analysis based on a 10× chromium WGS dataset.

3.5 Applications

3.5.1 CRISPR/Cas9 and Off‐Targets

Cas9 is an enzyme (a protein) that can induce a double‐stranded break in DNA. It was discovered in the Streptococcus pyogenes bacterium where it assists the bacterium to remember that it has seen an invading DNA virus before. It does this by incorporating the invading viral DNA into its own genome and by using sequence complementarity to recognise and cut the DNA of the virus should it infect the bacterium again. It was subsequently discovered that the Cas9 protein was responsible for this sequence‐specific recognition and that this could be reprogrammed to recognise desired genomic target sites. Importantly, it could also be transferred to work in mammalian cells to cut double‐stranded DNA at a defined location. The era of precision genomics was moved on significantly.

The repair of the double‐strand break by the cellular machinery is most frequently achieved by the non‐homologous end joining (NHEJ) process. This process is somewhat error prone and leads to short insertion or deletion events at the cut site. Less frequently, a mechanism known as homology directed repair (HDR) is used by the cell. This relies on copying the correct sequence from a template DNA (naturally the sister chromatid). This can be exploited to introduce defined changes and insertions of longer stretches of DNA into the genome. The results of experiments directed at specific genes use bioinformatics techniques to design primers and to prioritise CRISPR sites. The experiment can be evaluated by NGS genotyping to find out what the effect of the DNA repair has been.

The Cas9 protein scans the genome and stutters over pairs of G bases. The protein can be programmed to latch on to a specific genomic sequence of around 20 bases ending in ‘NGG’ (where ‘N’ is any base). The programming tool is a 20‐base guide RNA that is provided along with the Cas9 protein. There are about 300 million ‘NGG’ sites in the human genome. Furthermore, there can be off‐target cutting effects where there are one or more mismatches between the RNA guide and its target.

Figure 3.12 illustrates the genome browser interface of the WGE Crispr design tool (www.sanger.ac.uk/htgt/wge). This tool is a bioinformatics tool that assists in the selection of sites for CRISPR/Cas9 experiments. It supports many different features, including interactive searching for CRISPR/Cas9 sites in human and mouse genomes.

Image described by caption.

Figure 3.12 An example of a bioinformatics tool (a web application, or WebApp) that integrates a genome browser (wtsi‐web.github.io/Genoverse) along with many other tools to present useful information to the scientists designing CRISPR/Cas9 precision genome editing experiments. The guide sequences (here called ‘Crisprs’) are shown in green in the Crispr track. The PAM (protospacer adjacent motif) site (NGG or CCN when the sequence is reversed and complemented on the other strand) is coloured in blue. The PAM site is not part of the Crispr guide but is shown in blue in the Crispr track for convenience. Whether the Crispr is PAM‐right or PAM‐left tells us whether we are dealing with NGG or CCN PAM sites. It is necessary because WGE stores all its Crispr location data on the plus strand of the DNA duplex. The blue pop‐up window shows summary information for a Crispr object that has been clicked with the mouse. Several tracks show no data for this region.

The WGE implementation consists of three logical parts but users are normally only aware of the web application. This provides the context for searching and the visualisation of the search results. WGE also links to (i) a relational database (actually, PostgreSQL) that provides a persistent store of CRISPR and off‐target information and (ii) a CRISPR‐Analyser tool that runs on a separate server that provides fast indexed lookups to all the off‐targets in the human and mouse genomes with up to 4 mismatches to the 20‐base guide RNA query. The CRISPR‐Analyser provides its results principally in the form of an off‐target string. The off‐target string shown in Figure 3.12 is {0: 1, 1: 0, 2: 2: 1, 3: 7, 4: 89}. This means that there is one sequence in the human genome that matches the CRISPR guide exactly (this is the query or on‐target sequence). There are no hits with just 1 mismatch in the guide sequence; 1 hit with 2 mismatches, 7 hits with 3 mismatches and 89 hits with 4 mismatches. The off‐target string is a helpful data structure that is used to help prioritise CRISPR guides for experimental design [11].

For the bioinformaticians, it is crucial to understand these basic mechanisms and to come up with computational solutions – either by programming new systems or by using tools written by others that can form part of a bioinformatics pipeline.

In researching, designing and implementing the WGE system many decisions had to be made about what exactly would constitute a ‘Crispr’ object in the system. Considerations of space for storage, rapidity of access to on‐target and off‐target data and summarisation of information all had to be carefully balanced to create a responsive system. Even the fine details of assigning IDs to Crisprs became critical (i.e. they cannot change once assigned) when a feed of Crispr locations was provided to the Ensembl genome browser.

The source code for WGE and the CRISPR‐Analyser is available on GitHub at github.com/htgt/WGE and github.com/htgt/CRISPR‐Analyser. The WGE code is written mainly in Perl and Javascript, and the CRISPR‐Analyser in C++.

3.5.2 Drug Discovery and Target Discovery

Bioinformatics has contributed significantly to drug discovery (discovery of new medicines) since the early 1990s when sequencing programs in pharmaceutical companies were introduced to trawl short‐read sequences, called expressed sequence tags (ESTs), for new variants of known genes. Drug targets are typically proteins that are amenable to having their function modulated by small molecule ligands. Such ligands are known as drugs (medicines). Now that we have more advanced sequencing capabilities, transcriptomics has become a major research endeavour. Projects underway are aimed at sequencing tissues (e.g. GTeX consortium) and single cells (e.g. HCAHuman Cell Atlas) from a wide variety of sources and mapping their transcripts to build up a profile of the genes that are expressed in individual cells across whole organisms.

The Human Cell Atlas (www.humancellatlas.org) is a major international collaboration in this area whose mission is ‘to create comprehensive reference maps of all human cells – the fundamental units of life – as a basis for both understanding human health and diagnosing, monitoring, and treating disease’.

The GTeX Genotype Tissue Expression consortium (www.gtexportal.org) catalogues genetic variation and the influence of that variation on gene expression within and between all major tissues in the human body. By building up a database and associated tissue bank of all major tissues from 1000 individuals, bioinformaticians have a rich resource of genetic variation, gene expression, histological and clinical data to mine.

Bioinformatics techniques are intimately involved in discovering new drugs and repurposing known drugs for new uses. Bioinformatics has been used to help understand the mechanism of drug action and through phylogenetics and the elucidation of evolutionary pathways has been able to clarify disputed molecular mechanisms [12]. Often highly divergent species are involved in the analysis and new approaches for improving phylogenetic methods are applied in the context of phylogenomics [13] – i.e. phylogenetics applied at the genomic scale.

3.5.3 Transcriptomics

When we are interested in looking at the expression of the complement of genes in a cell, we need to sequence the transcripts that have been transcribed from the genomic DNA into messenger RNA. In drug discovery, understanding the genes that are actually transcribed (and later expressed) is critical to discovery of new therapeutic targets. Generally, small molecule drugs will be targeted at specific proteins, rather than DNA or RNA. Discovering all the proteins expressed by a cell is done using biophysical techniques, including advanced mass spectrometry and the tools of proteomics. It can be effective simply to look at mRNA transcripts, a technique that also provides the ability to count transcripts and to see the difference in transcription levels between cells in healthy and diseased states, or at different stages in development.

The current RNA‐Seq technology that can be used to analyse RNA transcripts relies on transcribing the RNA transcripts back to DNA for sequencing, so it is in fact a DNA sequencing technique. This involves both copying (reverse transcription) and amplification (PCR) steps. From a bioinformatics perspective this creates problems because we have to assume that the reverse transcription and PCR processes have amplified the transcripts uniformly in generating the cDNA library, which may not always be the case. Transcript counting depends upon this assumption.

Newer native RNA‐Seq technologies (for example, Oxford Nanopore Technologies) remove the necessity for reverse transcription to cDNA and enable single longer molecules of RNA to be sequenced. This has the added advantage of sequencing through splice sites and making alternatively spliced transcripts easier to count [14]. This ability to derive reliable transcript counts is important because experiments often involve comparing gene expression between different conditions, such as drug treated compared to non‐drug treated.

Using these and associated genomics strategies, there is great potential to improve drug targeting and creation of diagnostic tools with increased sensitivity. The role of bioinformatics is central in optimising algorithms for alignment, quality metrics, standards for variant calling and interpretation as well as making the results available globally. This will ultimately foster a deeper understanding of disease and lead to greater therapeutic precision [15].

3.6 Concluding Remarks

Bioinformatics is a scientific discipline that requires understanding of molecular biology and computational information systems. Some facility with programming is valuable for bringing together tools into workflows or pipelines. However, it is also true that many bioinformaticians focus entirely on analysis and development of new approaches while delegating the coding aspects to those skilled in the art. There are many potential applications and we have focused on computational genomics here along with a small selection of tools that can help navigate around the extensive and highly interconnected world of genomic data. It is worth exploring further avenues of bioinformatics research whether in the journal Bioinformatics (for computational techniques and software), Nature Biotechnology (for technologies related to bioinformatics), or Nature and Science journals where important work on genomics is regularly published.

References

  1. 1 International Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature 409: 860–921.
  2. 2 Ledford, H. (2016). AstraZeneca launches project to sequence 2 million genomes. Nature 532: 427.
  3. 3 Zook, J., Chapman, B., Wang, J. et al. (2014). Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32: 246–251.
  4. 4 Livingstone, C.D. and Barton, G.J. (1993). Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. Comput. Appl. Bio. Sci. 9: 745–756.
  5. 5 Attwood, T.K. and Parry‐Smith, D.J. (1998). Introduction to Bioinformatics. London: Pearson.
  6. 6 Sievers, F., Wilm, A., Dineen, D.G. et al. (2011). Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7: 539.
  7. 7 Parry‐Smith, D.J., Payne, A.W.R., Michie, A.D., and Attwood, T.K. (1997). CINEMA ‐ a novel Colour INteractive Editor for Multiple Alignments. Gene 211 (2): GC45–GC56.
  8. 8 Finn, R.D., Attwood, T.K., Babbitt, P.C. et al. (2016). InterPro in 2017 – beyond protein family and domain annotations. Nucleic Acids Res. 45 (D1): D190–D199.
  9. 9 Li, H., Handsaker, B., Wysoker, A. et al. and 1000 Genome Project Data Processing Subgroup (2009). The sequence alignment/map (SAM) format and SAMtools. Bioinformatics 25: 2078–2079.
  10. 10 Abuín, J.M., Pichel, J.C., Pena, T.F., and Amigo, J. (2015). BigBWA: approaching the burrows–wheeler aligner to big data technologies. Bioinformatics 31 (24): 4003–4005.
  11. 11 Hodgkins, A., Farne, A., Perera, S. et al. (2015). WGE: a CRISPR database for genome engineering. Bioinformatics 31 (18): 3078–3080.
  12. 12 Xuhua, X. (2017). Bioinformatics and drug discovery. Curr. Top. Med. Chem. 17 (15): 1709–1726.
  13. 13 Parry‐Smith, D.J. (2003). Bioinformatics: its role in drug discovery. In: Burger's Medicinal Chemistry and Drug Discovery, 6e, vol. 1 Drug Discovery (ed. D. Abraham). Hoboken, New Jersey, USA: Wiley.
  14. 14 Hussain, S. (2018). Native RNA‐sequencing throws its hat into the transcriptomics ring. Trends Biochem. Sci. 43 (4): 225–227.
  15. 15 Ashley, E.A. (2016). Towards precision medicine. Nat. Rev. Genet. 17 (9): 507–522.

Further Reading

  1. Lodish, H. et al. (2013). Molecular Cell Biology. UK: Macmillan Higher Education.
  2. Parrington, J. (2015). The Deeper Genome. Oxford: OUP.
  3. Rodriguez‐Ezpeleta, N., Hakenberg, M., and Aransay, A.M. (eds.) (2012). Bioinformatics for High Throughput Sequencing. New York: Springer.
  4. Wang, X. (2016). Next Generation Sequencing Data Analysis. Florida: CRC Press.
  5. Watson, J. et al. (2014). Molecular Biology of the Gene Seventh Edition. New York: Cold Spring Harbour Press.

Websites

  1. www.ukbiobank.ac.uk/2018/04/whole‐genome‐sequencing‐will‐transform‐the‐research‐landscape‐for‐a‐wide‐range‐of‐diseases
  2. www.nature.com/news/human‐genome‐project‐twenty‐five‐years‐of‐big‐biology‐1.18436
  3. en.wikipedia.org/wiki/List_of_sequence_alignment_software#Short‐Read_Sequence_Alignment
  4. www.ensembl.org
  5. www.ucsc.edu
  6. www.targetvalidation.org
  7. www.ensembl.org/Homo_sapiens/Tools/Blast
  8. samtools.sourceforge.net
  9. www.mlo‐online.com/long‐read‐sequencing
  10. www.10xgenomics.com
  11. www.sanger.ac.uk/htgt/wge
  12. www.humancellatlas.org
  13. www.gtexportal.org
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset