Bioinformatics Workflow of Whole Exome Sequencing

Posted by beauty33 on November 23rd, 2018

The advent of next-generation sequencing (NGS) has greatly accelerated genomics research, which produces millions to billions of sequence reads at a high speed. Currently, available NGS platforms include Illumina, Ion Torrent/Life Technologies, 454/Roche, Pacific Bioscience, Nanopore, and GenapSys. They can produce reads of 100-10,000 bp in length, enabling sufficient coverage of the genome at a lower cost. But faced with the enormous amount of sequence data, how do we best deal with them? And what are the most appropriate computational methods and analysis tools for this purpose? In this review, we focus on the bioinformatics pipeline of whole exome sequencing (WES).

Whole exome sequencing is a genomic technique for sequencing the exome (all protein-coding genes). It is widely used in basic and applied research, especially in the study of Mendelian diseases. You can read the article principle and workflow of whole exome sequencing to know more about WES. A typical workflow of WES analysis includes these steps: raw data quality control, preprocessing, sequence alignment, post-alignment processing, variant calling, variant annotation, and variant filtration and prioritization. They will be discussed below.

Figure 1. A general framework of WES data analysis (Bao et al. 2014).

Raw data quality control

Sequence data generally have two common standard formats: FASTQ and FASTA. FASTQ files can store Phred-scaled base quality scores to better measure sequence quality. It is, therefore, widely accepted as the standard format for NGS raw data. There are multiple tools developed to assess the quality of NGS raw data, such as FastQC, FastQ Screen, FASTX-Toolkit, and NGS QC Toolkit.

Read QC parameters:

1. Base quality score distribution

2. Sequence quality score distribution

3. Read length distribution

4. GC content distribution

5. Sequence duplication level

6. PCR amplification issue

7. Biasing of k-mers

8. Over-represented sequences

Data preprocessing

With a comprehensive read QC report (generally involves the above parameters), researches can determine whether data preprocessing is necessary. Preprocessing steps generally involve 3’ end adapter removal, low-quality or redundant read filtering, and undesired sequence trimming. Several tools can be used for data preprocessing, such as Cutadapt and Trimmomatic. PRINSEQ and QC3 can achieve both quality control and preprocessing.

Sequence alignment

There are algorithms for shot reads mapping, including Burrows-Wheeler Transformation (BWT) and Smith-Waterman (SW) algorithms. Bowtie2 and BWA are two popular short reads alignment tools that implement BWT (Burrows-Wheeler Transformation) algorithm. MOSAIK, SHRiMP2, and Novoalign are important short reads alignment tools that are implementations of SW algorithm with increased accuracy. Additionally, multithreading and MPI implementations allow significant reduction in the runtime. Of all the tools mentioned above, Bowtie2 is outstanding by fast running time, high sensitivity, and high accuracy.

Post-alignment processing

After reads mapping, the aligned reads are post-processed so as to remove undesired reads or alignment, such as reads exceeding a defined size and PCR duplicates. Tools such as Picard MarkDuplicates and SAMtools can distinguish PCR duplicates from true DNA materials. Subsequently, the second step is to improve the quality of gapped alignment via indel realignment. Some aligners (such as Novoalign) and variant callers (such as GATK HaplotypeCaller) involve indel alignment improvement. After indel realignment, BQSR (BaseRecalibrator from the GATK suite) is recommended to improve the accuracy of base quality scores prior to variant calling.

Variant calling

The variant analysis is important to detect different types of genomic variants, such as SNPs, SNVs, indels, CNVs, and larger SVs, especially in cancer studies. It is vital to distinguish somatic from germline variants. Somatic variants present only in somatic cells and are tissue-specific, while germline variants are inherited mutations presented in the germ cells and are linked with patient’s family history. Variant calling is used to identify SNP and short indels in exome samples. The common variant calling tools are listed in Table 1. Some studies have evaluated these variant callers. Liu et al. recommended GATK, and Bao et al. recommended a combination of Novoalign and FreeBayes.

Table 1. The common variant calling tools.

Variant calling


Germline variant calling

GATK, SAMtools, FreeBayes, Atlas2

Somatic variant detection

GATK, SAMtools mpileup, Issac variant caller, deepSNV, Strelka, MutationSeq, MutTect, QuadGT, Seurat, Shimmer, SolSNP, jointSNVMix, SomaticSniper, VarScan2, Virmid


Like it? Share it!


About the Author

Joined: July 10th, 2017
Articles Posted: 283

More by this author