Individual StringTie Quants
After aligning your RNA-seq samples to a reference genome with STAR (see our two-part STAR tutorial here and here), the alignment data is output into a single sorted BAM file for each sample. For the differential expression statistical work downstream, however, we need numerical data: specifically, the number of reads that aligned to each gene and transcript in the reference genome. To extract this information from the alignment files, I typically use the tool StringTie to create a gene count matrix from the sorted alignments and the reference annotations. For each sample, the code I use looks like this:
stringtie \ -e -B -p 8 -G reference.gtf \ -A sample.stringtie.abund.txt \ -o sample.stringtie.gtf \ sample_STARAligned.sortedByCoord.out.bam
The -e option tells StringTie to operate in “expression estimation” mode, limiting its scope to identifying reads overlapping with known annotations from the reference, and omitting counts of any unannotated regions. This speeds up the process considerably and provides all the information needed for a standard differential expression analysis, but means that StringTie will not provide any assemblies of potential novel isoforms. The -B option is similar, and allows the creation of Ballgown compatible coverage tables for the reference transcripts. Both of these options require the -G option with the reference annotations in either GTF or GFF3 format.
The -p flag specifies the number of threads to use, and defaults to 1 if no value is provided. The value you’ll want to use will depend on your computing system.
The -A and -o flags are where we can set the names of our output files – first the tabular gene count file, and then the GTF file containing all detected annotations. If you are looking for novel isoforms and omit the -e option above, this GTF file will contain the assembly information for those novel isoforms as well as anything from the reference GTF that was identified in your sample. Otherwise, it may not be needed for downstream analysis.
Finally, we let StringTie know which BAM file we’d like it to use as input! This BAM file does need to be sorted by genomic location, so if you didn’t sort by coordinate as part of your STAR alignment, you’ll need to use another program like samtools to sort your BAM file before running StringTie.
Summarizing Project Data
At this point, each sample will have its own directory containing the StringTie output. We have successfully extracted numerical data and generated count matrices, but now we want to merge all of these individual tables into a single gene count matrix to make it simpler to normalize, group, and compare our samples. The team that created StringTie provides a useful script for this, called prepDE (available from the StringTie GitHub page here), that we can use from the parent directory in which all our StringTie output folders live:
python2 prepDE.py -i . or python prepDE.py3 -i .
Generating TPM and FPKM Values
This gene count matrix is all you really need to proceed with expression analysis, but it can also be useful to have tpm and fpkm values for your samples. While there are a lot of ways to do this, I like to use a perl script from the Griffith lab GitHub. This is an older script that hasn’t been updated in over five years and recently I’ve had some problems with it that may be caused by an update to StringTie affecting the output formatting. On my GitHub page you can find a very slightly edited version of the script that may work for you if the original does not.
dirs=(list of stringtie directory names) perl stringtie_expression_matrix.pl \ --expression_metric=TPM \ --result_dirs=$dirs \ --transcript_matrix_file=transcript_tpm_all_samples.tsv \ --gene_matrix_file=gene_tpm_all_samples.tsv perl stringtie_expression_matrix.pl \ --expression_metric=FPKM \ --result_dirs=$dirs \ --transcript_matrix_file=transcript_fpkm_all_samples.tsv \ --gene_matrix_file=gene_fpkm_all_samples.tsv
In discussions with researchers who aren’t bioinformaticians themselves, I’ve noticed some uncertainty about the meaning and use of TPM and FPKM values. FPKM stands for fragments per kilobase of exon per million fragments mapped, and is calculated for each feature (gene or transcript) by dividing the fragment count for that feature by the length of the feature and by the total number of fragments mapped anywhere on the genome. Mathematically, that is,
Where qi is the number of fragments mapped to feature i, li is the length of feature i, and ∑jqj is the number of all the fragments mapped to features within the genome. The FPKM value is designed to compare the expression of different genes within a single sample, and is not an ideal metric for comparing samples to each other. You may also see the term RPKM (reads per kilobase of exon per million reads mapped) used; this is essentially the same thing as FPKM, just for single-end instead of paired-end sequencing data.
TPM (transcripts per million) is closely related to FPKM, and can in fact be calculated using FPKM values: the TPM value for a given feature is equal to the FPKM value for that feature divided by the sum of all FPKM values for features in that reference genome:
While TPM was originally intended to facilitate cross-sample comparisons, recent testing has shown it to be unreliable for this purpose. For a great discussion of both of these values and their usefulness in differential expression analysis, I’d recommend reading the open-access article “TPM, FPKM, or Normalized Counts? A Comparative Study of Quantification Measures for the Analysis of RNA-seq Data from the NCI Patient-Derived Models Repository” by Zhao, Y., et al (2021), in the Journal of Translational Medicine. Both of the normalization count methods the author recommends (DESeq2 and TMM) will be used in our differential expression analysis for cross-sample comparisons, so we’ll discuss them more in later installments.