Menu Close

Sequence Processing and Alignment

Sequencing Reads and Quality Metrics

Data found in folders 1.fastq and 2.sequencing.qc

The subfolder 1.fastq contains the demultiplexed fastq reads from the Illumina sequencer, including the forward and reverse reads as separate files. 2.sequencing.qc contains the fastqc quality metrics for each fastq file as well as a quality summary generated with MultiQC that includes an HTML visualization and tabular statistical information.

Alignment of Short Read Sequences

Data found in folder 3.alignment-and-counts

The first step of RNA-sequencing analysis is to align the short reads to a reference genome to capture read counts for transcripts/genes despite their variable lengths. If a reference genome is available in NCBI, we can use this as the alignment reference; otherwise, we can use a personal genome assembly from the originating researcher. In either case, to include annotations for each gene and enable functional enrichment analysis downstream, both a fasta and gff/gtf file are required.

The alignment itself is performed with the Alex Dobin’s open-source tool STAR, which searches for the maximum mappable portion of each read on the genome iteratively for each subsequent unmapped portion, enabling it to detect splice junctions. After the maximum mappable seeds are found, they are stitched together to incorporate alignments with mismatches, indels, and splice junctions into full-length transcripts.

The STAR analysis outputs a BAM alignment file for each sample. BAM files are binary computer-readable files that contain all alignment information for query-subject pair identified by the aligner (in this case, STAR). They are primarily used as intermediate files to feed into the next step of the pathway, but can be converted to human-readable SAM files if desired for manual inspection.

Once the alignments are generated, we use the open-source tool StringTie, with several publicly-available and in-house helper scripts, to count all transcripts and genes based on the reference genome annotations and generate a unified gene count matrix (the accompanying transcript count matrix can be used for determining alternate splice junctions and variants for individual genes, but that analysis is not included in out standard RNA-sequencing deliverables).

The output from StringTie is provided in the folder 3.alignment-and-counts and includes the transcript and gene count matrices as well as FPKM and TMM normalized copies of those matrices.

Previous installment in differential gene expression deliverables: Example Folder Structure

Next installment in differential gene expression deliverables: Differential Gene Expression