Bioinformatics Workflow for Whole Genome Sequencing

Posted by kiko on January 29th, 2023

Whole genome sequencing (WGS) has the capacity to greatly enhance genomic knowledge and understand mysteries of life by utilizing the most advanced genetic sequencing technologies. WGS can be used for variant calling, genome annotation, phylogenetic analysis, reference genome construction, and more. WGS tries to cover the whole genome, but actually covers 95% of the genome with technical difficulties in sequencing regions such as centromeres and telomeres. Another challenge for WGS is data management. As larger datasets become more accessible and affordable, computational analysis will be the rate-limiting factor rather than sequencing technology. Here we will discuss the bioinformatics workflow for detection of genetic variations in WGS to help you get through it.

The bioinformatics workflow for WGS is similar to that for whole exome sequencing. You can view our article Bioinformatics Workflow for Whole Exome Sequencing. The bioinformatics workflow for WGS falls into the following steps: (1) raw read quality control; (2) data preprocessing; (3) alignment; (4) variant calling; (5) genome assembly; (6) genome annotation; (7) other advanced analyses based on your research interest such as phylogenetic analysis.

Raw read QC and preprocessing

The raw files (fastq) need to be eliminated from poor-quality reads/sequences and technical sequences such as adapter sequences. This process is important for accurate and reliable variation detection. FastQC ( is a powerful tool for raw read QC that generates statistics data results, including basic statistics, sequence quality, quality scores, sequence content, GC content, sequence length distribution, overrepresented sequences, sequence duplication level plots, adapter content, and k-mer content. Tools like Fastx_trimmer and cutadapt can be used for read trimming.


A reference genome needs to be determined. Mash enables us to compare the sequencing reads generated against the reference set from NCBI RefSeq genomes ( to determine genetic distance and relatedness. The next step is to map the quality-controlled reads to the reference genome. Burrows-Wheeler Aligner (BWA) and Bowtie2 are two popular short read alignment algorithms. The output of BWA and Bowtie2 is the standard sequence alignment/map format known as SAM, which facilitates the following steps. Alternatively, BLAST ( is widely used for local alignment.

Variant calling

Once reads are aligned to the reference genome, variants can be identified by comparing the sample genome to the reference genome. Detected variants may be associated with disease, or simply be non-functional genomic noise. Variant call format (VCF) is the standard format for storing sequence variations, including SNPs (single nucleotide polymorphisms), indels, structural variants, and annotations. Variant calling can be complicated due to high rate of false positive and false negative identifications of SNVs and indels. The software packages in Table 2 are useful for improving variant calling.

Genome assembly

De novo assembly is the process to align overlapping reads to form longer contigs (larger contiguous sequences) and order the contigs into scaffolds (a framework of the sequenced genome). If there is a reference genome from a related species, the common method is to first generate contigs de novo and then align them to the reference genome for scaffold assembly. An alternative approach is the “Align-Layout-Consensus” algorithm. This method first aligns reads to a closely related reference genome, and then builds contigs and scaffolds de novo.

Users can assess the quality of draft genome assemblies or compare assemblies generated by different methods. There are a variety of metrics that reflect the quality of the assembly. Only contiguous near-complete (approximately 90%) assembly interrupted by small gaps will yield successful genome annotation.

(1)     Genome size. Both C-value and k-mer frequency-based approaches can infer genome size.

(2)     Assembly contiguity. N50 statistic can be used to evaluate assembly contiguity, which describes a kind of median of assembled sequence lengths.

(3)     Accuracy. Transcriptome data present an important resource for validating sequence accuracy and correcting scaffolds. Comparative genomic approaches can also provide guidance in detecting mis-assemblies and chimeric contigs.

Genome annotation

To fully understand the genome sequence, it needs to be annotated with biologically relevant information such as gene ontology (GO) terms, KEGG pathways, and epigenetic modifications. The annotation involves two phases:

(1) Computational phase. A computational phase includes repeat masking, prediction of coding sequence (CDS), and prediction of gene models.

a)       Repeat masking. Since repeats are poorly conserved across species, it is recommended to create a species-specific repeat library by utilizing tools such as RepeatModeler, RepeatExplorer.

b)       Prediction of CDS. Predict CDS using ab initio algorithms.

c)       Prediction of gene models. Protein alignment, syntenic protein lift-overs from other species, EST, and RNA-seq data can provide a valuable resource for predicting gene models.

(2) Annotation phase. All the evidence mentioned above (ab initio prediction, as well as protein-, EST-, and RNA-alignments) is then synthesized into a gene annotation. Additionally, automated annotation tools such as MAKER and PASA are available to integrate and weigh the evidence. WebApollo can be used to edit the annotation through the visual interface if anything is wrong with the gene annotations.

Once the genome annotation is assessed by visual inspection, you can publish the draft genome sequences and annotation. In order to allow others to improve the genome assembly and annotation, all raw data should be uploaded. The available databases for uploading genome include ENSEMBL and NCBI.

Like it? Share it!


About the Author

Joined: November 27th, 2018
Articles Posted: 75

More by this author