CHAPTER 45
Estimating MicroRNA Expression Using the miRDeep2 Tool

GVPPSR Kumar,, A Kumar and AP Sahoo

Animal Biotechnology Division, IVRI, UP, India

45.1 INTRODUCTION

For detection of miRNA from NGS data, several software tools have been developed to support the data analysis. These include: miRTRAP; DSAP; miRExpress; mirTools; miRDeep; miRNAkey and mireap; miRanalyzer; Mirena, and so on. Among this software, miRDeep and mireap are considered to be the best for prediction of novel miRNAs from mammalian data sets (Li et al., 2012). Here, we will discuss miRDeep (Friedlander et al., 2008, 2012). The codes and associated annotations have been taken from available guidances available online. In several cases, the explanations are verbatim with the source. The source URLs have been duly cited in this chapter.

miRDeep is a tool that helps in identifying miRNAs from the large pool of sequenced transcripts from a deep sequencing run. A probabilistic model is used to take into account the miRNA biogenesis for scoring fitness, and position the RNA sequence with the secondary structure of the miRNA precursor. miRDeep2 is an overhauled version of the original miRDeep algorithm, with added extensive new packages. The accuracy and sensitivity of miRDeep2 are estimated through its internal statistical controls.

Both the canonical and non‐canonical miRNAs in deep sequencing data can be identified through miRDeep2. The miRNA expression profiling across samples can also be done using this tool. This includes: preprocessing of raw Illumina reads with mapper.pl script; quantification and expression profiling by quantifier.pl script; and miRNA identification by the miRDeep2.pl script.

45.2 PREPROCESSING OF READS

The reads are processed and mapped to the reference genome using mapper.pl script. This mapper module processes deep sequencing reads and/or maps them to the reference sequence.

Flow diagram displaying boxes with labels from deep sequencing short reads to mapping to the referece and to mirDeep2.pl DESeq2 for differential expression and identifies novel and known miRNAs.

FIGURE 45.1

The module can process or map data that are in FASTA format, and can also handle sequence space data. It has a number of functions that can be implemented specifically with Illumina data. This entire chapter is explained using the datasets available in the miRDeep2 tutorial: (https://www.mdc‐berlin.de/36105849/en/research/research_teams/systems_biology_of_gene_regulatory_elements/projects/miRDeep/documentation).

45.3 INPUT FORMATS OF THE DATA FILE

The default input file can be in FASTA, seq.txt or qseq.txt formats. For more options, please refer to https://www.mdc‐berlin.de/36105849/en/research/research_teams/systems_biology_of_gene_regulatory_elements/projects/miRDeep/documentation

45.4 OUTPUT FORMATS THAT CAN BE GENERATED

The output depends on the options used. A *.fasta file containing the processed reads or an *.arf file with mapped reads (or both) can be generated as output. For example, we may say that the user generally wishes to analyze deep sequencing data mapping to a ≈ 6 kb region on C. elegans chromosome II for known and novel miRNA genes (this is as per the mirDeep2 tutorial at the address given previously).

45.5 PRELIMINARY FILES USED IN THE EXAMPLE

These are as per the miRdeep2 tutorial:

  • cel_cluster.fa: a FASTA file with the reference genome.
  • mature_ref_this_species.fa: *.fasta file containing reference miRBase mature‐miRNA sequences for the species (C. elegans miRBase v.14 mature miRNAs) (http://petang.cgu.edu.tw/Bioinfomatics/Lecture/0_HTS/04/20120316.pdf).
  • mature_ref_other_species.fa: *.fasta file harboring mature‐miRNA sequences (from miRBase) for related species (C. briggsae and D. melanogaster miRBase v.14 mature miRNAs).
  • precursors_ref_this_species.fa: Similarly, this is a FASTA file with the precursor miRNAs for the species (C. elegans miRBase v.14 precursor miRNAs, from miRBase).
  • reads.fa: a FASTA file with the deep sequencing reads.

45.5.1 Step 1: Building index with bowtie

: >./bowtie‐build cel_cluster.fa cel_cluster.

This command generates six files in the bowtie folder. Copy all the index files to the miRDeep2 folder.

45.5.2 Step 2: Process reads and map them to the genome

  • The –c option designates that the input file is a FASTA file.
  • The –j option: to remove the entries with non‐canonical letters (characters other than a, c, g, t, u, n, A, C, G, T, U, N).
  • The –k option: to clip the adapters.
    No alt text required.

    FIGURE 45.2

    Window of mirdeep2 with command buttons labeled back, view, arrange, action, share, and edit tags, with 6 contents under name, namely, cel_cluster.rev.1.ebwt, cel_cluster.rev.2.ebwt, cel_cluster.1.ebwt, etc.

    FIGURE 45.3

  • The –l option: to discard the reads that are shorter than 18 nts.
  • The –m option: to collapse the reads.
  • The –p option: for mapping the processed reads against the previously indexed genome (cel_cluster).
  • The –s option: to name the output file of processed reads.
  • The –t option: for specifying the name of output file of genome mappings.
  • –v gives verbose output to the screen.

Go to the mirdeep2 directory and type the following command:

mapper.pl reads.fa –c –j –k TCGTATGCCGTCTTCTGCTTGT –l 18 –m –p cel_cluster –s reads_collapsed.fa –t reads_collapsed_vs_genome.arf –v

No alt text required.

FIGURE 45.4

The reads collapsed are those reads that are generated after clipping the adapter sequence. The collapsed reads mapped to the genome are given in the .arf file.

Figure 45.5 shows the reads in reads.fa that were collapsed to collapsedreads.fa. For example, the first read in collapsedreads.fa is obtained after clipping the adaptor sequence of sequence 4 (>nematiode_4) in the reads.fa file. The.arf file is the aligned reads file that shows the place where the reads match exactly in the genome. For example, the collapsed read one exactly matches with reference genome at positions 3060–3081.

Image described by surrounding text.

FIGURE 45.5

45.5.3 Step 3: Fast quantitation of reads mapping to known miRBase precursors

Quantification of reads to known mirBase precursors is done using a quantifier.pl script. The deep sequencing reads are mapped to the predefined miRNA precursors by the quantifier module, to determine the expression of the corresponding miRNAs. Initially, the predefined mature miRNA sequences are mapped to the predefined precursors, followed by the mapping of the deep sequencing reads to the precursors.

  • Input: this could be a FASTA file with precursor sequences, a FASTA file with mature miRNA sequences, a FASTA file with deep sequencing reads or, optionally, a FASTA file with star sequences and the three‐letter code of the species of interest.
  • Output: a tab‐separated file called miRNAs_expressed_all_samples.csv with miRNA identifiers and its read counts, a signature file called miRBase.mrd, a file called expression.html that gives an overview of all miRNAs in the input data, a directory called pdfs that contains for each miRNA, and a.pdf file showing its signature and structure (see the mirDeep2 tutorial at the address given previously).

The command is:

quantifier.pl –p precursors_ref_this_species.fa –m mature_ref_this_species.fa –r reads_collapsed.fa –t cel –y 16_19

The –p option denotes miRNA precursor sequences from miRBase database. The –m option designates miRNA sequences from miRBase database, the –t option designates the name of the species which is being analyzed, and the –y option designates the timestamp.

No alt text required.

FIGURE 45.6

The output is generated in the form of:

miRNAs_expressed _all_samples_16_19.csv, which gives the read counts of the reference miRNAs in the data in tabular format

pdfs_16_19 – details of miRNA were identified.

expression_16_19. html – presents all the results in html format. This file is present in the expression analyses folder in the mirdeep2 directory

Windows of .csv, .mrd, and reads.fa vs .mrd file.

FIGURE 45.7

45.5.4 Step 4: Identification of known and novel miRNAs in the deep sequencing data

The novel and known miRNA detection can be done using the miRDeep2.pl script. The output from mapper module is used by the miRDeep2 module.

  • Input: the input files for miRDeep2 are: a FASTA file with deep sequencing reads; a file of mapped reads to the genome in miRDeep2 arf format; a FASTA file of the corresponding genome; an optional FASTA file with known miRNAs of the analyzing species; and an optional FASTA file of known miRNAs of related species.
  • Output: the output generated is a spreadsheet and an html file with an overview of all detected miRNAs.

Go to the miRDeep2 directory and type the following command:

>miRDeep2.pl reads_collapsed.fa cel_cluster.fa reads_collapsed_vs_genome.arf mature_ref_this_species.fa mature_ref_other_species.fa precursors_ref_this_ species.fa ‐t C.elegans 2 > report.log

The file “mature_ref_this_species.fa” contains all mature miRNA of C. elegans species, while the “mature_ref_other_species.fa” file contains all mature miRNA of C. briggsae and D. melanogaster species. By using “2>”, all progress output will be piped to the report.log file.

The results.html generated after running the above command contains all the results generated from miRDeep2.pl. In addition, the command will also generate a directory with .pdfs showing the read signatures, structures, and score breakdowns of novel and known miRNAs in the data.

45.6 QUESTIONS

  1. 1. Name two software programs that that can predict novel miRNA from RNA‐Seq reads.
  2. 2. True or false? The .arf file is similar to the .sam file.
  3. 3. What is an .mrd file?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset