Menu Close

Differential Gene Expression

Differential Expression Analysis

Data primarily found in folder 4.differential-expression

Once the gene count matrix has been generated, it is normalized and analyzed for differential expression. Three primary open-source R packages (DEseq2, edgeR, and NOISeq) are used for this process by the core, as all have different strengths and weaknesses and approach the statistical problems in slightly different ways. DEseq2 determines differentially expressed genes using logistic regression models to calculate expression change and the Wald test to establish significance. EdgeR is designed for low-replicate datasets (less than 7-8 replicates per condition); it models gene expression with a negative binomial distribution and assesses differential expression using an adapted Fisher’s exact test. NOIseq filters low-count features taking the experimental design into consideration and corrects for batch effect as part of normalization; it was originally designed for experiments with no replicates and is ideal for that use case.

DEseq2’s normalization algorithm is considered one of the top metrics for RNA sequencing, because it takes into consideration both read depth (like FPKM, CPM, and TPM metrics) and population composition (the change in gene expression ratios due to increased reads for some genes). The resulting normalized gene count matrix is included in the folder 3.alignment-and-counts along with the CPM-normalized matrix created by edgeR (downstream edgeR analysis uses a similar normalization method to DEseq2, not the CPM matrix, however). The remaining output files from these three programs are provided in the folder 4.differential-expression.

For each tool, a PCA or MDS plot is included showing clustering patterns across the top two axes of variability. A heatmap representing the top 50 most variable genes (according to each tool’s normalization algorithm) is also included, along with a heatmap of all differentially expressed genes.

For the highest certainty of results, you can select only those genes considered differential by all three tools. However, the core’s standard process is to take only genes considered differentially expressed by at least two tools into clustering and functional enrichment analysis.

Within the clustering subfolder of 4.differential-expression, you can find the results of gene expression clustering analysis. DEseq2 normalized count values are extracted for genes determined to be differential by at least two of the R packages and grouped by similar patterns of expression across all samples. A heatmap showing cluster assignment, a scatter plot of average expression of each cluster for each sample, and a table containing the list of genes (by id) within each cluster are included.

Each group comparison has its own subfolder within 4.differential-expression. These folders contain the differential gene lists generated by all three DEG analysis methods, the merged differential gene lists containing information for each gene from each tool (filtered at a minimum log2 fold change of 0, 1, or 2 respectively), Venn diagrams showing the overlap between differentially expressed genes determined by each tool, an MA plot and volcano plot from DEseq2, an MA plot from edgeR, and an MD plot and expression plot from NOIseq.

Top left, MA plot: a graph of the average log2 fold change between sample groups for each gene against the average normalized count for each gene, showing the relationship between variability and expression level for each gene in a given group comparison.
Top right, expression plot: a graph of the average expression of each gene in group 2 against the average expression of each gene in group 1; typically looks fairly linear with significant outliers marked in red.
Bottom left, volcano plot: these plots show the general relationship between log2 fold change and adjusted p-value (graphed as -log10 of the value so that increased significance is higher on the y-axis). They typically look like an erupting volcano because the significance of a potential DEG call tends to increase as the calculated log2 fold change in expression increases.
Bottom right, MD plot: the additive difference between gene expression for each gene in the two groups against the multiplicative difference between for that same comparison, providing a visual of how differential expression tracks across expression level of each gene.


Previous installment in differential gene expression deliverables: Sequence Processing and Alignment

Next installment in differential gene expression deliverables: Functional Enrichment Analysis