A heatmap is one of the most effective ways to quickly visualize the differences between RNA sequencing samples on a broad scale – and is relatively simple to create in R, either on a genome-wide basis or highlighting the most variable genes from the experiment. While any normalized count matrix can be used as input, I’m going to use the CPM values from out previous edgeR analysis tutorial here.
There are several packages that can be used for heatmap generation in R, but I prefer pheatmap for its simplicity and customization options. Since it isn’t a Bioconductor-specific package, installing it and loading it into your instance of R is easy. Also, we will read in the CPM file you made in the edgeR tutorial (or a similar normalized count matrix you created in a different analysis or received from a genomics core).
install.packages("pheatmap")
install.packages("viridis")
library("pheatmap")
library("viridis")
cpm <- read.delim("CPM.txt", sep="\t", header=TRUE, row.names)
The most basic heatmap we can make uses all the genes in the normalized count matrix. In this figure, I am choosing not to show the rownames – which would be the gene names – because there are so many of them that the fontsize would have to be ludicrously small to avoid overlap with adjacent row labels. I am also choosing to scale by row to prevent the few highly-expressed housekeeping genes from overshadowing between-sample differences in genes with lower overall expression. pheatmap automatically clusters by both row and column, so samples and genes with similar patterns of expression will be grouped together.
pdf("allgenes.edgeR.heatmap.pdf")
pheatmap(cpm,
scale="row",
color=inferno(30),
fontsize_row = 4,
main=paste0("Normalized Counts for All Genes"),
show_rownames = FALSE)
dev.off()
However, depending on your hardware and number of samples, there is a chance you may receive the following error when using this script: “Error: vector memory exhausted (limit reached?)” Given that there are over 55K gene annotations in the mouse genome, this isn’t particularly surprising.
Instead, we can choose to focus on the top thousand variable genes – which is honestly where we’re going to see most of the biologically relevant differences in any case – and which should give us a much more manageable computation. Note that this is still far too many genes to make annotation worthwhile!
#apply the var function to our cpm matrix to order genes by variability
var_genes <- apply(cpm, 1, var)
#select the names of the top 1000 most variable genes
var_1000 <- names(sort(var_genes, decreasing=TRUE))[1:1000]
#select the subset of the cpm matrix matching those top 1000 genes
cpm_1000 <- cpm[var_1000, ]
#finally, rerun the heatmap function
pdf("genes.edgeR.heatmap.pdf")
pheatmap(cpm_1000,
scale="row",
color=inferno(30),
fontsize_row = 4,
main="Normalized Counts for Top 1000 Genes",
show_rownames = FALSE)
dev.off()
For this particular experiment, the heatmap shows us that there is a marked difference between our A-series samples and both the B and C-series samples, while there is a much smaller difference between gene expression in the B-series and the C-series. This reassures us that there is real biological difference between the sample groups in our experiment and gives us a good idea of where we’ll see the most differential expression in the more detailed analyses.
If we do want a smaller heatmap with gene annotations, it isn’t hard to alter the code slightly for that purpose:
var_genes <- apply(cpm, 1, var)
var_100 <- names(sort(var_genes, decreasing=TRUE))[1:100]
cpm_100 <- cpm[var_100, ]
pdf("topgenes.edgeR.heatmap.pdf")
pheatmap(cpm_100,
scale="row",
color=inferno(30),
fontsize_row = 4,
main=paste0("Normalized Counts for Top 100 Genes"))
dev.off()
This is about as far as I would be comfortable pushing the fontsize, and it is quite hard to read without magnification. Also, it can be hard to get an idea about which genes are variably expressed unless you have all the Ensembl ids memorized! Let’s create one more heatmap with fewer genes, increased fontsize, and common gene names in place of the Ensembl ids. You can also expand the height of the output file to allow for larger annotations.
#install and load packages
install.packages("BiocManager")
BiocManager::install("biomaRt")
library("biomaRt")
#import reference genome information from Ensembl
mart <- useDataset("mmusculus_gene_ensembl", useMart("ensembl"))
#create list of top variable genes as before
var_genes <- apply(cpm, 1, var)
var_50 <- names(sort(var_genes, decreasing=TRUE))[1:50]
cpm_50 <- cpm[var_50, ]
#find the gene name associated with each Ensembl id
var_ids <- getBM(filters = "ensembl_gene_id",
attributes = c("ensembl_gene_id", "external_gene_name"),
values = var_50,
mart = mart)$external_gene_name
#change the matrix rownames from the IDs to the gene names
rownames(cpm_50) <- var_ids
#create the heatmap as a pdf with height double the default width
pdf("top50genes.edgeR.heatmap.pdf", height=10)
pheatmap(cpm_50,
scale="row",
color=inferno(30),
fontsize_row = 8,
main=paste0("Normalized Counts for Top 50 Variable Genes")
)
dev.off()
What is particularly useful about this heatmap and the hundred gene heatmap is that they show a large set of genes that is clearly more highly expressed in the A-series samples, a smaller set that is more highly expressed in the B-series samples, and a final even smaller set that is more highly expressed in the C-series samples – suggesting clear differentiation between all three of the primary experimental factors.
I hope this is helpful! Please comment here or contact us if you have any questions about creating a heatmap with your own data.