Menu Close

Tutorial: Differential Expression Analysis with edgeR

Originally published in 2010 and cited an astounding >27K times as of this month, the edgeR package (Robinson, McCarthy, and Smyth, 2010) is a reliable and commonly used tool for differential expression in RNA sequencing. Like STAR, it has robust documentation, providing not just an introduction to the software itself but a foundational understanding of the experimental and statistical factors to consider for RNA sequencing in general. The tutorial I’m providing here is just one way to approach this type of data and use the edgeR tool.

Installing Packages

To begin, you’ll need R v4.2, an IDE of your choice (I used RStudio), as well the R packages BiocManager, edgeR, and viridis (though this last one is optional). BiocManager is an installation tool that facilitates installation of biologically-applicable R packages like edgeR and viridis is a color palette tool that I prefer to the base R color palettes.

#install required packages
install.packages("BiocManager")

BiocManager::install("edgeR")

install.packages("viridis")

#load required packages
pkgs <- ("edgeR", "viridis")
lapply(pkgs, function (x) library(x, character.only = TRUE)

Reading In Data

Once the packages are installed, we can read in our gene count matrix from StringTie and establish our sample groupings and comparisons of interest. I like to do this using CSV template files for the sample groups and comparisons, so I don’t have to type everything into the R script itself; it doesn’t make much difference for a small project, but it reduces a lot of tedious data entry and human error risk on a large one. As you can see from the following examples, these are fairly straightforward!

comparisons.csv

Differential comparisons for edgeR analysis
NameGroup1Group2Group3
A_vs_B1-10
A_vs_C10-1
B_vs_C01-1
A_vs_Others1-0.5-0.5
B_vs_Others-0.51-0.5
C_vs_Others-0.5-0.51

factors.csv

sample assignment into the groups provided in comparisons.csv
NameGroup
A11
A21
A31
B12
B22
B32
C13
C23
C33

In the comparisons.csv file, for each comparison, the group (or groups) denoted with a negative value are the control or baseline against which change is expressed, and the group (or groups) denoted with a positive value are the treatment or experimental group. Here, we have both standard pairwise comparisons (1 vs -1) as well as weighted comparisons that let us combine smaller groups into one (1 vs -0.5+-0.5) to really focus what is unique about each particular group. It is important that you get the positive/negative attributions correct, or you may be thinking a gene shows increased expression in your treated samples when it is actually showing decreased expression…

Anyway, reading in the CSV files is quite easy with R’s read.csv command:

gcnts <-read.csv(genes.csv, header=TRUE, row.names = "gene_id")

factors <- read.csv(factors.csv, header=TRUE, row.names = "Name")

comparisons <- read.csv(comparisons.csv, header=TRUE)

As a precaution, to ensure that the sample order is consistent between the gene count matrix and the sample group assignments, the following reordering can be run:

gcnts <- gcnts[ , match(rownames(factors), colnames(gcnts))]

Then, we need to change the data type of our factors data frame into a factor with levels determined by our group assignment column:

group <- factor(factors$Group)

The last preparatory step is to put all the comparisons that we defined in the comparisons data frame into a list object, so we can take advantage of R’s lapply family of functions to accelerate the analysis.

complist <- lapply(split(comparisons, comparisons$Name),
                   function(x) {
                     as.list(x)
                     x<- x[c(-1)]
                     unlist(x, use.names=FALSE)
                     }
                   )

This should give you an object structured like this:

complist structure visual from RStudio

Creating A Model

I prefer to use the quasi-likelihood (QLF) test in edgeR to define the statistical model for the analysis, following the recommendation from the user guide that “the QL F-test is preferred as it reflects the uncertainty in estimating the dispersion for each gene. It provides more robust and reliable error rate control when the number of replicates is small.” To do this, we first read the gene count matrix into an DGEList object for edgeR, and then use the calcNormFactors function to calculate the normalization factors used to even out experimental variability (for example, from different total read counts between samples). I’ve also included some code for creating an MDS plot with dynamic color attribution for different number of groups.

y<-DGEList(counts=gcnts)
y<-calcNormFactors(y)

num <- nlevels(group)
pal <- viridis_pal(option = "H", begin=0.1, end=0.9)(num)
colors <- lapply(factors,
                 function(x) {
                   x <- c(pal[x])
                 })

pdf("MDSplot.pdf")
plotMDS(y,cex=0.5,col=colors$Group)
dev.off()

The MDS plot should look similar to this one, ideally with the same well-defined separation between the sample group clusters. If you aren’t seeing clear differentiation between your sample groups at this stage, there may be other issues with the experimental design or  significant biological difference may not exist. (It’s always the worst having to let a researcher know that there isn’t any meaningful difference between their experimental groups…)

At this point I also like to write out the CPM (counts per million) table, since that is the only type of normalized count that wasn’t output by our StringTie scripts, and edgeR makes it quite easy.

d<-cpm(y)
write.table(d,"CPM.txt",quote=FALSE,sep="\t",row.names=TRUE)

Finally, we can generate the overall model for the experiment:

design<-model.matrix(~0+group)
y<-estimateGLMCommonDisp(y,design) #Estimate Common Negative Binomial Dispersion
y<-estimateGLMTrendedDisp(y,design) #Estimate Trended Dispersion for Negative Binomial GLMs
y<-estimateGLMTagwiseDisp(y,design) #Empirical Bayes Tagwise Dispersions for Negative Binomial GLMs

qfit<-glmQLFit(y,design)
gene.constant <- dim(y)[1]

Including tagwise dispersions for the model means that the dispersion for each gene is calculated individually in a Bayesian manner, using the more general dispersions as a foundation to improve upon, and is the dispersion type recommended by the edgeR user guide, especially for experiments with multiple factors (like this experiment!)

Differential Expression

Once the model is defined, we can look more closely at each of our comparisons of interest using the QLFTest function, which takes the fitted data output by the QLMFit function and runs pairwise comparisons as directed. Using lapply with our list of comparisons lets us test all of those comparisons in a single function, after which we can use the topTags function to extract the differential expression data table – ranked by p-value –  with a small amount of accompanying metadata. I also like to write out a text file for each comparison containing these data tables, without any filtering, so that information such as gene counts, p-values, and fold changes can be easily referenced for any gene of interest.

qlf.list <- lapply(complist,
                   glmQLFTest,
                   glmfit = qfit,
                   coef = NULL)

top.qlf.list <- lapply(qlf.list,
                       topTags,
                       n = gene.constant) 

#write DEG information for all genes
sapply(names(top.qlf.list),
       function (x) write.table(top.qlf.list[[x]], file = paste0(x, ".deg.all.edgeR.txt"), quote=FALSE))

The topTags data tables can also be used to create volcano plots for each comparison, providing an visualization of the average read count, fold change, and significance of each gene. I typically include reference lines at a log2 fold change of 1, equivalent to doubling or halving the baseline gene expression.

lapply(names(qlf.list),
       function(name) {
         pdf(file = paste0("volcano.edgeR.",name,".pdf"))
         plotMD(qlf.list[[name]], main=name)
         abline(h=c(-1,1), col="blue")
         dev.off()
       })

These plots are fairly simple, but can be edited to include gene labels if desired and enhanced in various ways.

example volcano plot for a pairwise differential expression comparison

Finally, we can extract the data for only the genes that pass our statistical significance cut-off from the topTags objects and write them out to text files. Because different researchers have different preferences, and for convenience in downstream analysis, I will usually create a list of all significant genes, a list of only the genes with increased expression in the treatment vs. the baseline, and a list of only the genes with decreased expression in the treatment vs. the baseline.

Because p-values can be less reliable with RNA-sequencing, I use the adjusted p-value (the FDR column) for the cutoff. The default adjustment used by edgeR is Benjamini Hochberg, but you can always double-check in the metadata of your topTags object.

deg.list <- lapply(top.qlf.list,
                   as.data.frame)

deg.sig.list <- lapply(deg.list,
                       function(x)x[x$FDR < 0.05,] )

sapply(names(deg.up.list), 
       function (x) write.table(deg.up.list[[x]], file = paste0(x, ".deg.all.sig.edgeR.txt"), quote=FALSE, sep='\t')) 

deg.up.list <- lapply(deg.sig.list,
                      function(x) {
                        x <- x[x$logFC > 0,]
                        x <- x[order(-x$logFC),]
                        }
                      )

sapply(names(deg.up.list),
       function (x) write.table(deg.up.list[[x]], file = paste0(x, ".deg.up.sig.edgeR.txt"), quote=FALSE, sep='\t'))

deg.down.list <- lapply(deg.sig.list,
                        function(x) {
                          x <- x[x$logFC < 0,]
                          x <- x[order(-x$logFC),]
                          }
                        )

sapply(names(deg.down.list),
       function (x) write.table(deg.down.list[[x]], file = paste0(x, ".deg.down.sig.edgeR.txt"), quote=FALSE, sep='\t'))

I hope that is helpful to someone out there! Please leave a comment or contact us if you have any questions.

1 Comment

  1. Pingback:Tutorial: RNA-seq Heatmap with edgeR

Leave a Reply

Your email address will not be published. Required fields are marked *