Menu Close

Tutorial: RNA Alignment with STAR, part 2

From part 1 of this tutorial you should have created STAR indexes for your reference genome, such that you have a reference directory containing the fasta sequence of the genome, the annotations in GTF format, and the STAR indexes.

The central command for aligning your short-read fastq sequences to this genome is as follows:

    --genomeDir /path/to/references/ \
    --readFilesCommand gunzip -c \
    --readFilesIn L1_S1_L001_R1_001.fastq.gz L1_S1_L001_R2_001.fastq.gz \
    --outFileNamePrefix L1_STAR \
    --runThreadN 6 \
    --readFilesCommand  gunzip -c \
    --outSAMtype  BAM SortedByCoordinate \
    --outSAMstrandField intronMotif \
    --outFilterIntronMotifs RemoveNoncanonical

There are so many other options to include with this command. I typically leave them at the default setting for a first run, and then alter them if the first alignment doesn’t look great. If you are going to edit a lot of the default settings, however, it can be easier to create a text file with the STAR parameters and pass that to the script. Below is a sample text file followed by the code for passing it to the aligner.

## params.txt ##
runThreadN                   6 
readFilesCommand             gunzip -c 
outSAMtype                   BAM SortedByCoordinate 
outSAMstrandField            intronMotif 
outFilterIntronMotifs        RemoveNoncanonical 
outFilterMatchNminOverLread  0.66 
outFilterScoreMinOverLread   0.66 
outFilterMatchNmin           0 
peOverlapNbasesMin           0

    --genomeDir /path/to/references \ 
    --readFilesCommand gunzip -c \ 
    --readFilesIn L1_S1_L001_R1_001.fastq.gz L1_S1_L001_R2_001.fastq.gz \ 
    --outFileNamePrefix L1_STAR \
    --parametersFiles params.txt

This code is specific for paired-end sequences, but can be adapted for single-read sequences simply by giving only one fastq file to the aligner, instead of the two paired files.

This is the part of the script that takes the longest – up to four hours for a sample with high read depth on a system with lower memory – and requires the most memory (STAR cites 16GB as the minimum amount of memory required, but for mammalian genomes they recommend at least 32GB, as do I).

The aligner outputs a BAM file sorted by genome coordinates (it can also output SAM files, or leave the BAM unsorted), a log file of the alignment process, and a final alignment summary file. This summary file provides a lot of useful information regarding the quality of the alignment, so let’s take a look here:

The first few lines tell us when the job started (12:20am), when the actual read mapping started (12:21am), when the job finished (12:40am), and the average mapping speed over the course of the alignment (99.95 million reads per hour). Fortunately for me, I had these jobs running on a scheduler, so I wasn’t up past midnight aligning RNA samples! Next, the log file gives us some high-level information about our input: the number of input reads (32,040,605) and the average input read length (302 – a bit long for standard differential expression analysis, but at the time we had a bulk sequencing discount for 2x150bp reads).

After that, we start getting some of the statistics about the alignment. In the section titled UNIQUE READS, the first two lines telling us the uniquely mapped reads number and percentage are the first two I check for a quick idea of the alignment quality. If the total number of uniquely mapped reads is less than 25M or the percentage of uniquely mapped reads is less than 70%, it’s typically a sign that I need to rerun the alignment with different parameters or potentially go back to the wet lab and sequence more (or even rebuild the library if the percentage is low enough). This particular library just passed those filters, so I then check the mismatch, deletion, and insertion rates per base. If the mismatch rate is higher than 2.5%, or the deletion or insertion rates are higher than 0.5%, I’ll triage the sample by running a stringent quality filter on the data and then realigning (and potentially resequencing if the quality filter reduces the total read number by too much, or if the error rates are still unacceptably high). This particular sample looks excellent, however.

Finally, I check the percentage of reads that are mapped to multiple loci and the percentage of reads that are too short. Often, when I have a sample with a low percentage of uniquely mapped reads, it will have a correspondingly higher percentage of reads that are too short. This data can be rescued by changing some of the alignment parameters – for example, instead of the default setting that 66% of the bases in a read have to be aligned, I can specify that only 50% of them need to be aligned, or give a minimum value like 30bp instead. This does also increase the number of multi-mapped reads along with the error rates, so it can be a bit of an optimization game to determine which settings allow you to capture the most information from your data without crossing over your quality thresholds.

For a quick look at some of these basic quality metrics over a whole project, I’ve written up a simple little script you can download on GitHub.

Bar graph showing mismatch, deletion, and insertion rates per base from a representative STAR alignment
Mismatch, deletion, and insertion rates per base
Horizontal bar graph showing the number of input reads falling into the different STAR alignment categories (uniquely mapped, multimapped, overly multimapped, unmapped due to mismatches, unmapped due to length, and unmapped for other reasons
Example read count metrics for a representative STAR alignment sample set

Let me know if you have any issues with or questions about this part of the process! Next time, we can move on to gene counts, normalizations, and differential expression analyses.

1 Comment

  1. Pingback:Tutorial: RNA Count Quantification with Stringtie

Leave a Reply

Your email address will not be published. Required fields are marked *