Fundamentals of Next Generation Sequencing Data Analysis

This Document will describe the Basic Steps in Next Generation Sequencing Data Analysis. It wills Cover Quality Checks and methods to work with Low quality Data. We will also discuss on aligning the data with the Reference Genome and the potential Downstream Analysis.

Next Generation Sequencing Data Analysis may be complex and time taking. Here we explain the analytic in a very simplified format. The data which you will normally receive from the core facility will be in FASTQ Format, which is basically your sequences (also known as Reads) along with quality of the individual base call. They will likely also provide you will lots of other results, which they might have made you pay up for. Those are the results of automated analysis. Which means they have not been properly checked for quality nor the analytic has been optimized for your study. Using the results of automated analysis pipeline may also hide the problems pertaining to the low quality of Reads.


Checking the Quality of  Next Generation Sequencing Data

The machines generate Images and the software performs base calling and what we get is the FASTQ files. FASTQ files are the starting material for all the analysis. FASTQ files can be opened on your PC as text file, provided you have reasonably powerful computer. Below is an example of one of the entries of FASTQ files. There will be millions of such entries in your FASTQ files.
@HWI-EAS283:48:64NLWAAXX:4:4:9203:1082 1:N:0:GATNAG
The first line is the identifier of your Read. Second is the sequence and fourth is the Quality score for each base called. Since it’s impossible to go through each sequence and quality, there are softwares to study the quality of the Sequence.
FASTQC is one of the best programs to study the Quality of your Data Generated. Low quality Reads will lead to lower alignments and inaccuracies in your results. The software runs on Windows, Mac and Unix and can be downloaded from

Output of the FASTQC software. If the Box plot reaches the Red region consider trimming the Reads.

Working with FASTQC is straightforward. Select the FASTQ files and the software calculated the quality parameters. If there is no flag, that’s a good news. Proceed to your alignment. If there are certain flags then it need to be worked out. Very low quality data may be useless and you might need to go back to the core facility and ask them to re-sequence the sample.
There are two main types of problems that may have popped in when it comes to low quality of data. One is certain number of reads have problems and if it’s a small percentage, they can be removed prior to aligning. Large number of such reads may indicate a potential sample preparation problem.
The other more relevant and common problem is the low quality at the tail end of the sequence. This is mostly due to technical issues faced by the Core Facility. If it is too bad you can chase them to again sequence your sample. If it is not that bad then sequences at the tail end can be removed prior to alignment. How much of the sequence need to be removed in your intuitive guesswork. One way to go around is to chop the tail at variable lenght, perform the alignment and see, with which you are getting the maximum alignment.

Prune your Sequence To get Retain Good Quality Sequences

Now is the time to Learn Linux. Rest of the analysis happens on Linux machine. If you have a high spec machine you can install Linux as dual Boot. You can always use Amazon or Google or Microsoft Cloud to run your analysis. We will discuss that in some other Blog. But right now I expect that you have access to Linux machines.
Transfer all your FASTQ files on the Linux Machines. Install FASTX from The instructions on how to install are provided on the page. Once you have installed, there are multiple things that you can do with your data. FASTQ Quality Filter is to filter sequences of low quality. FASTQ Quality Trimmer will trim your sequences to only retain the good part of the sequence. It is quite simple once you are familiar with Linux commands and file systems.

Next Generation Sequencing Data Alignment and Improving the Alignment Percentage

Now we have sequences only of high quality, and it’s time to align your sequences with the Reference Genome. There are a number of aligners to choose from. Some are optimized for short Reads while other are optimized for Long Reads. E.g. Bowtie is optimized for short read, whereas Bowtie2 is optimized for Large size Reads. Another good aligner is BWA
You will also need to download Reference Genome files from iGENOMES After you have your Good Quality Sequences, Reference Genome and Aligner installed its time to start aligning your data. Alignment can go from days to week depending upon the size of your data. It’s always good to try different aligners, try changing the parameters. And evaluate the performance of aligners and parameters based on the alignment percentage. Higher the alignment percentage better will be the accuracy of downstream results. If you observe unusually low alignment percentage it may be due to sample contamination or may be due to wrong selection of Reference genome. It may be also due to badly selected parameters. Normally you should be able to achieve the alignment rate of 80%. The output of the alignment is BAM (Binary Aligned File) and will form the base for downstream analysis.
From this point analysis becomes specific to what your plan to study. We will discuss that in separate blogs.
There are some Free Apps Developed  to  Tease out the Sequencing Analytics. You can have a look at those Apps.

Love Technology, Follow us