Bioinformatics Analysis of Small RNA Sequencing

Posted by kiko on April 4th, 2023

Small RNAs are important functional molecules in organisms, which have three main categories: microRNA (miRNA), small interfering RNA (siRNA), and piwi-interacting RNA (piRNA). They are less than 200 nt in length and are often not translated into proteins. Small RNA generally accomplishes RNA interference (RNAi) by forming the core of an RNA-protein complex (RNA-induced silencing complex, RISC). Small RNA sequencing, an example of targeted sequencing, is a powerful method for small RNA species profiling and functional genomic analysis. Here, we present the guidelines for bioinformatics analysis of small RNA sequencing.

Raw data pre-processing and quality control

Raw data must be trimmed to account for adapter artifacts and sequences that aren't long enough for correct alignments.Reads less than 16–18 nt, representing degraded RNA or adapter dimers, need to be removed. Tools such as Btrim, FASTX-Toolkit, FaQCs, and cutadapt are used for this purpose. However, this is not enough for high-quality datasets and accurate alignments. There are algorithms such as Quake and ALLPATHSLG that are dedicated to correcting unreliable base calls by superimposing the most frequent and similar patterns on them. Reads of low quality also need to be removed partially or completely based on their Phred scores. Popular quality trimming algorithms include Cutadapt, Btrim, FASTX Toolkit, FaQCs, and SolexaQA.

After data pre-processing and quality control, the remaining reads should be free of low-quality sequences (quality score 20) and adapter artifacts, and read lengths should exhibit a distinct peak based on the small RNA species of interest (e.g., 21–23 nt for miRNA and 30-32 nt for piRNA).

Small RNA read alignment

Read alignment strategies involve mapping to a reference genome or specific small RNA databases such as mirBase and Rfam. In addition to comparisons with specific sequences, homologous datasets from well-studied organisms are also useful due to the strong conservation of seed sequences between most small RNA species in different species.


Systematic variations need to be addressed prior to differential expression analysis. This process is called normalization, which deals with undesired differences between libraries in sequencing depth, GC content, and batch effects. Median normalizing of expression ratios from geometric means has been found to work favorably with diverse kinds of datasets. Zyprich-Walczak et al. (2015) proposed a workflow to determine the most suitable normalization method for a specific dataset.

Differential expression analysis

Differential gene expression (DGE) analysis is a key part of the analysis of small RNA data, which helps predict targets and find biomarkers.There are several good tools for this purpose (Table 3), but the optimal tool is highly dependent on the specific dataset.

Biomarker identification and target prediction

Biomarker candidates can be identified by differential expression analysis. The tools shown in Table 1 can also be used for biomarker identification. The detected small RNA biomarkers are mainly based on miRNAs. There are several tools and software packages for the in silico functional analysis of miRNA. TargetScan, TargetFinder, and miRanda can be used for in silico target prediction. The predicted targeted genes are further analyzed by gene ontology (GO) and KEGG pathway analysis.


To confirm the small RNA sequencing results, differentially expressed small RNAs need to be examined by qRT-PCR. If it turns out to be consistent with the small RNA sequencing results, the small RNA sequencing data are confidential and reliable. The discovered biomarker signature can, therefore, be assumed after data validation.

Like it? Share it!


About the Author

Joined: November 27th, 2018
Articles Posted: 120

More by this author