In this multi-part tutorial we’ll walk step-by-step through a standard RNA alignment using the open-source tool STAR (Spliced Transcripts Alignment to a Reference). Alignment is the first step of data processing for transcriptomic analyses, and works by lining up the sequencing reads to the reference genome so that we can count how many reads correspond to each gene or transcript in that genome. STAR has an amazingly detailed manual and is a very robust tool with a lot of capabilities, so this is going to be a basic introduction that hopefully helps gives a beginner the confidence to try it out before diving in to the potential customizations available.
One of the unique challenges of RNA alignment, especially in eukaryotes, is that the entire length of a single sequencing read may not be able to completely align to a single region on the reference genome due to introns in the gene. Depending on the lengths of the exons involved, a single sequencing read may align to 2, 3, or more unique regions of the genome, leaving large gaps in between. STAR accounts for this by searching for short alignments, extending them out, and saving the best quality extension – then continuing this process with the section of the read following that extension. (“Best quality” can mean different things, and involves length, number of mismatches, and number of gaps. STAR allows you to customize these parameters.) STAR also considers both reads together when processing paired-end data, improving the overall accuracy of the alignment calls.
Before investing a lot of your time into this, make sure that you have a system that can handle the requirements of this tool! I typically run STAR with at least 32 Gb of RAM, and have to move my data to a server that allows FIFO files in order to run it. If you’re here at ASU, the free temporary storage on the /scratch drive works well for this.
If you haven’t used STAR before, you’ll need to install it following the instructions on the STAR github link above. If you’re at ASU and using the Agave computing cluster, you can activate the installation that already exists by using the following command on either your interactive session or in your sbatch script:
module load star/2.7.9a
Once you have STAR installed and functional, the next step is to create a STAR-readable index for your reference genome. The following code is designed to be run from a directory containing both a fasta file and a GTF file for the reference (typically accessible from a database such as NCBI). When retrieving the reference data, make sure that both files are using the same annotation system! If you accidentally try to pair a RefSeq GTF file with a GenBank fasta file, you are going to have errors because the chromosomes follow a different naming convention.
Additionally, while I typically set the sjdbOverhang to 100 since the STAR manual says it is a generally useful value for that parameter and allows me to use the same STAR indexes for projects with varying read lengths, you can optimize this for a specific project by setting it equal to 1 greater than the expected read length (i.e., 76 for a single-end 1x75bp run, or 151 for a paired-end 2x150bp run).
STAR \ --runThreadN 16 \ --runMode genomeGenerate \ --genomeDir ./ \ --genomeFastaFiles reference.fa \ --sjdbGTFfile reference.gtf \ --sjdbOverhang 100
Small genomes require some different settings, and all the details to consider can be found in the STAR manual; I have found it can take some trial and error to hit on a functional combination of options, so in case it is helpful here is the version of the indexing code that I used for an alignment specifically focusing on only the short non-coding RNA sequences from a bacterial genome.
STAR \ --runThreadN 4 \ --runMode genomeGenerate \ --genomeDir ./ \ --genomeFastaFiles GCF_000210855.2_ASM21085v2_genomic.fna \ --sjdbGTFfile GCF_000210855.2_ASM21085v2_ncRNA.gtf \ --sjdbOverhang 100 \ --sjdbGTFfeatureExon ncRNA \ --genomeSAindexNbases 4 \ --limitGenomeGenerateRAM 32000000000
Alright! Hopefully you’ve been able to work through any quirks with your computing system or reference genome – next time we can begin the alignment itself, using the index you’ve made here.