This module is used in all meta-omics workflows
Quality Filter and Trim
The first step of any metagenomics pathway is removing sequences derived from adapters. While adapters are necessary for Illumina sequencing, they can potentially contaminate the sample. When present, adapter sequences typically appear on the 3′ end of reads, and occur when sequence fragments are shorter than the sequencing read length and the sequencer reads into the adapter sequence at the end of the fragment. We use Trimmomatic to remove Illumina adapter sequences from the 3′ end of each read, with stringent alignment requirements to ensure that biologically meaningful sequences similar to the adapters aren’t inadvertently trimmed.
Along with trimming adapter sequences, we remove low-quality reads from each sample. All reads shorter than 100bp following the adapter trim are discarded, as are any reads where the average Phred score drops below 15 for any consecutive 4 base pairs. Finally, we generate new quality reports with FastQC and MultiQC so quality and adapter content can be compared to the unprocessed samples.
Remove Host Reads
We typically only remove host reads when analyzing a host-based microbiome – such as a human gut microbiome – and not from environmental samples such as soil. Waste water samples occupy an interesting middle ground and we will typically remove host reads out of caution.
To remove host reads, we align all reads from each sample to the host reference genome (using Bowtie2 for DNA samples and STAR for RNA samples) and retain only those reads that do not successfully map to the reference.
Example Folder Structure
The output from these scripts is returned in a folder structure similar to this. The returned SAM files contain alignment information for every read compared to the host reference genome, while the fastq files contain only reads that passed the QC filtering and adapter trimming and did not map successfully to the host reference genome. The fastqc and multiqc files provide quality statistics for these cleaned reads.
-cleaned_reads
|--alignments
|--sampleID_SQP_hostaligned.sam
|--...per sample
|--alignment_summary
|--sampleID_covstats.txt
|--sampleID_rpkm.txt
|--...per sample
|--fastq
|--sampleID-nohost_SQP_L001_R1_001.fastq.gz
|--sampleID-nohost_SQP_L001_R2_001.fastq.gz
|--...per sample
|--qc
|--multiqc_report.html
|--fastqc
|--sampleID-nohost_SQP_L001_R1_001.fastqc.html
|--sampleID-nohost_SQP_L001_R1_001.fastqc.zip
|--sampleID-nohost_SQP_L001_R2_001.fastqc.html
|--sampleID-nohost_SQP_L001_R2_001.fastqc.zip
|--...per sample
|--multiqc_data
|--multiqc_citations.txt
|--multiqc_data.json
|--multiqc_fastqc.txt
|--multiqc_general_stats.txt
|--multiqc_software_versions.txt
|--multiqc_sources.txt
|--multiqc.log
Next module for metagenomics: Assembly-Free Taxonomic Classification with KRAKEN
Next module for paired metagenomic/metatranscriptomics: Assembly-Free Taxonomic Classification with KRAKEN
Next module for viral-enriched metagenomics: Assembly-Free Taxonomic Classification with KRAKEN
Next module for metatranscriptomics: Assembly-Free Taxonomic and Functional Classification with HUMAnN
Next module for viral-enriched metatranscriptomics: Assembly-Free Taxonomic and Functional Classification with HUMAnN