CHAPTER 39
Overview of Microarray Data Analysis

RK Choudhary

School of Animal Biotechnology, GADVASU, Ludhiana

39.1 CONCEPT

Microarray technology is used to study the expression of many genes at a time, which is the simultaneous detection of many genes. Thousands of gene sequences (called probes) are placed on a glass slide (called a gene chip), and a sample containing DNA or RNA or protein (depending upon the types of microarray) called “probes” is placed in contact with gene chip. Hybridization occurs between targets and probes, and produces light that is measured and quantifed.

39.2 GETTING STARTED WITH MICROARRAY

The principle of microarray has been used in different areas of molecular biology (Source: Wikipedia), including: DNA microarrays (cDNA, oligonucleotide microarrays, SNP‐chip); MMChips (microRNA); Protein‐microarrays; Reverse Phase Protein Microarrays; Tissue‐microarrays; Chemical compound microarrays; Antibody microarrays; Carbohydrate arrays (glycoarrays), and so on.

These different types of microarray are spotted arrays on glass, self‐assembled arrays and in situ synthesized arrays. There are two types of experimental procedures, and they are either one‐color or two‐color approaches. In the one‐color approach, a sample is labeled singly with either Cy3 or Cy5 fluorophore, and hybridized to a single microarray chip. In the two‐color microarray, two samples (usually sample and reference) are labeled with two different fluorophores (Cy3 and Cy5), and are hybridized together in a single microarray chip. In a single‐color array, the result obtained is an absolute fluorescent signal, unlike that of the two‐color array, which produces ratios of fluorescent intensities. A one‐color array is simple to perform, but the results of two‐color arrays are more robust, due to internal references. However, due to high costs, dye biases, high input RNA, and difficulty in finding an appropriate reference sample with two‐color microarray, the single‐color array is preferred.

39.3 MICROARRAY DATA ANALYSIS: GENE EXPRESSION ANALYSIS

39.3.1 Microarray experimental design

  1. Reference design: Pooling of RNA samples serves as a common reference RNA(R). The logarithm of the ratio of the label intensities at those spots is used as a measure of relative hybridization. Label the reference, on each array, with the same dye. A limitation of reference design is that half of the hybridizations used for the reference sample may be of no real interest.
  2. Loop design: These are an alternative to the reference design. Two aliquots for each of the samples need to be arrayed for loop design while performing a cluster analysis.
image

FIGURE 39.1 Reference design (a) and loop design (b) of a two‐color microarray. Different colors (red and green here) represent microarray chips. In order to avoid dye bias, the same samples are used twice, with opposing labeling schemes, such as array 1: sample a (labeled with red dye) vs. Sample b (labeled with green dye) and array 2: sample b (labeled with red dye) vs. sample a (labeled with green dye).

39.3.2 Concepts of replicates

  1. Biological replicates: repeat hybridizations using the same RNA sample. This tells us about variation due to hybridization, imaging, etc.
  2. Technical replicates: repeat hybridizations using different RNA isolates (other animal/cells from the same group). Technical replicates indicate real variability in the sample.

39.4 STEPS INVOLVED IN MICROARRAY DATA ANALYSIS

  1. Image processing
  2. Background subtraction
  3. Normalization
  4. Identify differentially expressed genes
  5. Which genes are expressed?
  6. Which genes are differentially expressed?
  7. Cluster analysis (time series)
  8. Integration of differentially expressed genes with functional information: pathways

An overview of microarray is now given.

Flow diagram of the application of microarray for gene expression analysis starting from a population of cells of interest leading to a parallelogram containing 4 rows of dots with image of a microarray chip at the right.

FIGURE 39.2 Application of microarray for gene expression analysis. Fluorescently labeled cDNA or cRNA is hybridized with probes, and the image is scanned through a scanner. Based upon the intensity of the signal, up regulated (red dots) and down regulated (green dots) genes are detected.

39.4.1 Image processing

Microarray chips after hybridization are scanned using a microarray scanner, and quality images (.TIFF in 16‐bits/pixel) are obtained and then processed using suitable software. The content of the image is characterized by spot shape (morphology), spot intensity, background correction and noise level.

2 Cy3 vs Cy5 scatterplots for raw intensities (left) and log(2) intensities (right), each with a ascending line with dots. Dots at the right scatterplot are more concentrated near the line.

FIGURE 39.3 Data transformation converts the raw signal intensity of each probe‐target hybridization into a log scale. Transformation of the data brings values in a normal distribution.

39.4.2 Normalization

  • First, microarray expression data are transformed
  • Normalization of data does not mean the data that are not normally (Gaussian) distributed, but normalization of microarray data refers to processing of correcting data before comparing gene expression.
  • Normalization takes care of efficiencies of incorporation of Cy3 and Cy5 and brings them to comparable levels.
  • Normalization also enables comparison of multiple microarray experiments.
  • The first step of data normalization is calculation of the background signal:
    • a lower 5% signal can be used as background signal intensity;
    • signal of empty spots on array can be used as background signal.
  • Global normalization to raw signal intensity – average ratio of gene expression is 1.
    • Example: Mean green channel intensity of samples is 10 000 units and of red channel is 5000 units; then, the intensity of red channel will be multiplied by two such that mean ratio is 1. If data are log transformed the mean ratio would be 0.
  • Another approach to global normalization involves the use of housekeeping genes, the amount of starting quantities of RNA, and difference in the labeling efficiencies. Divide each gene expression value by the mean expression value of all housekeeping genes

39.4.3 Identification of differentially expressed genes

This involves steps in order to remove false positive results. The steps are:

  1. Fold change > twofold is most common.
  2. P‐value: the probability of a result being observed, given that the null hypothesis is true.
  3. Type I error (a, “p‐value”): false positives.
  4. Multiple testing corrections or Bonferroni corrections, or family‐wise error rate corrections (Bonferroni correction: set a to desired a/number of tests = 0.05/20 000 = 2.5 × 10–6).
  5. The above correction means that a cumulative critical value of 2.5 × 10–6 is considered, instead of using a p ≰ 0.05, for each of the genes.
  6. However, this is a very small value, and it is hard to get genes to qualify for the test.
  7. False Discovery Rate (FDR or q‐value): statistically obtained proportion of the false‐positives out of the positive results.
  8. We assume 10% of our results to be false positive at FDR =10.

39.4.4 Cluster analysis

This is done to see how a group of genes as a cluster varies between the two conditions, thereby dividing the experimental samples into homogeneous groups:

  • Supervised clustering:
    • Support Vector Machines (SVM)
    • Artificial Neural Networks (ANN)
  • Unsupervised clustering:
    • hierarchical clustering
    • k‐means clustering
    • self‐organizing maps (SOM)
    • principal component analysis (PCA)

39.5 FUNCTIONAL INFORMATION USING GENE NETWORKS AND PATHWAYS

Finally, the investigator needs to understand the underlying system’s biology. Differentially expressed genes are identified between the two groups (treatment and control) in order to explore the biological phenomenon. Genes interact with each other and forms gene networks. Gene networks could be described in four hierarchical levels;

  1. Part lists – genes, transcription factors, promoters, binding sites.
  2. Control logics – interactions between different combinations of regulatory signals.
  3. Topology – a graph describing the connections between the parts.
  4. Dynamics.

The information that we need to know from genes is about the gene product, its place and time of action and role in physiology. Bioinformatics initiates called “Gene Ontologies”, or simply “GO”, provide such information. GO teams provide three main domains of a gene, namely:

  1. Cellular components – the parts of a cell or its extracellular components.
  2. Molecular functions – functions of a gene product.
  3. Biological process – sets of molecular events that occur with the particular gene product.

39.6 LIVESTOCK RESEARCH THAT INVOLVED MICROARRAY ANALYSIS (SOME EXAMPLES)

  • Understanding the physiology of the mammary glands of bovine (Hu et al., 2009; Moyes et al., 2010) and mammary stem cells (Choudhary et al., 2013).
  • Gene expression profile of lactating and non‐lactating mammary gland (Suchyta et al., 2003).
  • Gene expression profile of Brahman steers (calf) muscle tissue, to understand remodeling of muscle tissue in response to nutritional stress (Byrne et al., 2005).

39.7 APPLICATIONS OF MICROARRAY

  • Gene expression analysis.
  • Genotyping.
  • Transcription factor binding analysis.
  • Treatment comparisons.
  • Detection of cancer vs normal cells.

39.8 QUESTIONS

  1. 1. Explain the importance of replicates in a microarray experiment.
  2. 2. What are the various steps involved in microarray data analysis?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset