What all to look for in your RNA Sequencing Data

This article will focus on conventional applications of RNA Sequencing, and will explore mining information for cSNP, Insertions Deletions & Fusion Genes, Alternate Splicing, Novel Genes/Exon, eQTL, and more.

RNA Sequencing is a treasure-chest of information and quiet often we miss on potential ground breaking information in the RNA-SEQ datasets. While we normally plan an RNA-Sequencing Experiment to quantify the expression in different samples and look for transcriptional changes and how it affects Gene Ontology, Biological Pathways and the Networks; there is a huge amount of other information sitting in the dataset waiting for it to be mined. This document will focus on traditional applications of RNA-SEQ, and will explore mining information deep down the ocean for the hidden gems.

 Introduction to RNA-Sequencing

RNA Sequencing (RNA-Seq) is a technology to sequence the transcriptome to sufficient depth, with purpose to estimate the expression of individual transcripts and genes. The technology has significant advantages over microarray in terms of expression estimation accuracy and elimination of background noise which has always haunted the Bioinformatics community. Besides, RNA-Seq technology opens up the system beyond gene expression and helps in better understanding of transcriptome.

Some of the well-known applications of RNA-Sequencing are listed below.

-Estimation of accurate expression

-Differential Expression of genes and transcripts, gene ontology analysis, pathways analysis and network analysis, co-regulation, analysis of regulome.

-Identification of Single Nucleotide Polymorphism in the coding region also known as cSNP

-Insertions and deletions in the coding region

-Identification of fusion genes

-Identification of splicing events, alternate splicing and identification of novel splice sites

-Identification of novel exons and transcripts.

-eQTL analysis

-Strand specific sequencing

-De-novo sequencing

-And many more

Below is the workflow for RNA Sequencing Data Analysis.

RNA Sequencing data analysis workflow

RNA Sequencing data analysis workflow

RNA Sequencing Quality Control

The outputs of the Sequencing machines are the reads, which is algorithmically tested for the quality of the reads. FASTQC software is used to estimate the quality of RNA Sequencing.

Quality Control for RNA-Sequencing Data

Two types of problems are often encountered.

-Some sequencing reads have low quality. The good approach is to remove them. If the number of such sequences is high, re-sequencing should be considered.

-Quiet often the quality falls at the tail end of the sequence. The obvious solution is to perform read trimming so that only high quality sequences are aligned. This will ensure better alignment and more reliable results.

RNA Sequence Alignment to the Reference Genome

Sequence alignment is one the most critical steps in the whole analysis. Loss of information is certain if the optimal parameters are not selected. The parameters depend on the RNA sequencing length, single or paired end and the genome. Besides, sequencing errors also plays an important role when working out the optimal parameters.  The right approach lies in understanding the fine tuning parameters and estimating the impact on the overall alignment. The alignment leads to Binary Aligned Map (BAM) files which is further sorted and indexed before being used for any downstream analysis. BAM files can be viewed in genomic browser to get further insight into the transcriptome and/or to confirm the findings.

The RNA Sequencing data can be aligned using a range of alignment tools. Some are suitable for short length sequencing, while others are suitable for long length sequencing. Bowtie is an excellent aligner for Short length Sequencing and Bowtie2 is more suitable for Long Reads. Tophat is software which sits on top of Bowtie and helps work around the splice sites. BWA is another good aligner for RNA Sequencing data. Once the data is aligned the Binary Aligned File (BAM) can be viewed on Integrated Genomic Browser.

Estimating the Expression from RNA Sequencing data

The aligned reads from RNA Sequencing data, at each locus is used to estimate the abundance and expression measures are calculated. The gene and transcript expression measure is represented as RPKM or FPKM. RPKM is Reads per Kilo base of transcript per Million mapped reads whereas FPKM is Fragments per Kilo base of transcript per Million mapped reads. RPKM/FPKM is used for downstream analysis such as identification of differentially expressed genes/transcripts, gene ontology analysis, pathways and network analysis. RPKM/FPKM can also be used for clustering genes or samples or can be developed as a diagnostics model. Cufflinks and Cuffcompare are excellent tools to estimate expression and identify differentially expressed genes,

Clustering to Find Co-Expressed Genes

Clustering to Find Co-Expressed Genes

Pathways and Networks from RNA Sequencing Study

Pathways and Networks from RNA Sequencing Study

Identification of Novel Genes and Transcripts from RNA Sequencing data

A stack of reads with no prior annotation can be an indicative of novel exons or transcripts. If the stack of reads is detected in the Coding Region, and has the Splice Boundaries, it is highly likely to be a novel exon (See Below Image). Advanced algorithms can be used to correctly determine the authenticity and boundaries of the novel exon.

Stack of Reads outside the known Coding Region can potentially be a novel genes previously undetected by the conventional methods. Properties related to genes such as Start/Stop Codons and TATA box along with other properties can be used to confidently quantify it as a gene.

Both the Novel Exons and Novel Genes/Transcripts can then be experimentally verified in the Lab.

Novel Exons,Genes and Alternate Splicing from RNA Sequencing Data

Novel Exons,Genes and Alternate Splicing from RNA Sequencing Data. The Reads marked as Red is an example of Novel Exon. The Reads market in Violet is an example of Alternate Splicing.

Identification of Alternate Splicing from RNA Sequencing data

Alternate splicing is known to play an important role in clinical conditions and is associated with phenotypic changes. The information on Alternate Splicing can be estimates from the RNA-Sequencing data. Overlapping reads at a particular Loci in one set of samples and absent in another set of samples is indicative of the Alternate Splicing event (See Above Image). Once quantified the information can be viewed on a Genomic Browser and later confirmed using PCR.

Identifying Coding SNPs from RNA Sequencing Data

The Binary Aligned Files from RNA Sequencing can be further used to mine coding SNPs. The information is obtained from overlapping reads discounting the sequencing errors. Accuracy may be questionable for low expressing transcripts, but with advanced algorithms accurate mining of cSNPs from RNA-Seq can be made possible. Mining of Coding SNPs can further bring in more insight to the transcriptome and its effect on the variant protein. The SNPs can be further analysed to estimate changes in the proteins and how much deleterious that particular SNP can be on the function of the protein.

Identifying Insertions and Deletions of the Genome using RNA-Sequencing Data

In diseases like cancer, there is a high likelihood of genomic insertion and deletions. Such insertions and deletions can have significant impact on the clinical outcomes. It becomes of high significance if it happens in coding region. RNA-Seq data has the potential to identify such genomic insertions and deletions. Such informatics can lead to better understanding on disease and can be incorporated in clinical decision making.

Quantifying Fusion Genes from RNA-Sequencing data

Genomic Fusions are well known in cancer and other clinical conditions. ABL-BRC fusion is well studied in cancer commonly known as Philadelphia chromosome. RNA-Sequencing Reads which don't align forms a base to mine for fusion genes. Paired end sequencing is preferred to mine for Fusion genes. The Pairs of Reads which don't align at proximity forms the base of identification of fusion genes. Once few such pairs are identified, breakup point is identified using the reads as single end.  The Location of the Pairs and the boundaries from the single read forms the base to quantify for all fusion genes. The results can be matched with popular known fusion genes database and can be viewed on genomic browser. The results can then be seen on Genomic browser and studied in lab for functional changes.

Identifying Fusion Genes from RNA Sequencing Data

Estimating eQTL for RNA- Sequencing data

With quantified expression and known cSNP, eQTL studies can be performed on the RNA-Sequencing Data. The computational requirement is huge and it can take weeks trying to associate SNP effect with Expression changes. Large number of samples is normally needed to study such associations. Shorting on the SNP associated with deleterious effect can hugely reduce the computational requirement. Cloud and cluster computing can be used to speed up the analysis.

De-Novo RNA Sequencing

RNA-Sequencing studies normally require a reference Genome. However RNA Sequencing studies can also be performed on Organism without Reference Genome. In such studies we need to create a partial reference genome using RNA-Sequencing Data and use it as a reference for alignment and then quantify the expression.

We help Companies and Researchers work with Next Generation Sequencing Data and develop automated pipelines which the user can run over Cloud. Contact us for more details at info@seqome.com

Download our RNA Sequencing Data Analysis White Paper


    Love Technology, Follow us