This module is used in all meta-omics workflows
HUMAnN Basic Methods
We use HUMAnN3 to compare reads to known genes, pathways, and functional modules using either UniRef50 or UniRef90 annotations. UniRef90 consists of clustered UniRef100 sequences where the sequences within each cluster have 90% sequence identity and 80% overlap with the longest sequence in the cluster; UniRef50 consists of clustered UniRef90 representative sequences (the longest sequence in each UniRef90 cluster) where the sequences within each cluster have at least 50% sequence identity and 80% overlap with the longest sequence in the cluster. UniRef90 can provide more granularity, especially in populations that are well-characterized, such as the human gut; UniRef50 can be more useful for metagenomes from less-studied environments such as soil, capturing more potential functionality by casting a wider net. Pathway definitions for functional characterization come from MetaCyc.
Secondarily to functional classification, HUMAnN runs taxonomic classification in the background with MetaPhlAn. This method uses a custom database, ChocoPhlAn3, containing over 5M unique clade-specific marker genes identified from reference and metagenomic-assembled genomes (the inclusion of metagenomic-assembled genomes helps classify reads from species that don’t yet have reference sequences available, as is common in environmental samples).
Taxonomic classification from MetaPhlAn is combined with the UniRef results to determine the contributions that different taxa within the population make to each UniRef protein family. This allows for more detailed information about how the different species within the population contribute functionally. Files returned here include the raw gene family counts per sample, the normalized gene family counts per sample, and the stratified and unstratified aggregate count files for the entire dataset.
Looking at the correlation between taxonomy and function the other way around, we return pathway abundance information that include taxonomic composition of each functional pathway found in the sample. Pathway abundance without corresponding taxonomic information is also returned, and we use this data for differential functional comparisons between sample categories. Confidence scores for the pathway calls can be found in the pathway coverage files.
Functional Enrichment by Category
The normalized pathway abundance data is used to calculate a PCA graph and then modeled using MaAsLin2, an R package designed for multivariate association studies in environmental and meta’omic studies. While we choose model specification based on our understanding of the metadata for your experiment, we include the specs file we used so you can confirm that it matches your design or adjust it for your own modeling with MaAsLin2 or another tool. The RDS objects for R generated by the software are also included for reference.
For each metadata category, MaAsLin2 models the sample groups to identify pathways that are significantly differentially expressed in one or more groups compared to the others. All results and all significant results are included in separate tables, and significant results are visualized as boxplots. As an example of the type of results produced here, the table might show that the genomics potential for a particular pathway such as octane oxidation is significantly reduced in patients with a certain diagnosis when compared to the other diagnoses in the data set.
Paired Metagenomics/Metatranscriptomics Functional Modeling
For paired metagenomics and metatranscriptomics samples only, we use an expansion of MaAsLin2 to model gene expression onto genetic potential and population classification and abundance. Taking advantage of the DNA functional and taxonomic data from the paired metagenomics samples, MTXmodel allows for comparison based on differential function by taxa. Since this is a customizable script, the code for the model (in R) is included with the data. Metadata categories are chosen for fixed and random effects to match the DNA MaAsLin2 model for the paired DNA samples, but you may wish to alter the model and rerun it.
All results and all significant results are included in tabular format, and significant results are visualized as boxplots as well as in two heatmaps. MTXmodel considers results with p < 0.05 to be significant, while the cut-off for the custom heatmap is q-value < 0.05. In the heatmaps, a higher coefficient represents a larger difference between that metadata category and the rest of the population. The standard heatmap includes all significant results, while the ‘exclusive’ heatmap includes only significant results from groups containing more than one individual to remove potentially outlying results.
This analysis can provide information at the level of which pathways are being differentially expressed by specific bacterial species between sample groups. For example, functional analysis alone may show that multiple pathways are overexpressed in condition 1, and the MTX model can show that taxa 1 is contributing more than expected to pathway 1, while taxa 2 is contributing more than expected to pathway 2. This lets us see not just which pathways in the microbiome differ between conditions, but also which species are largely responsible for those functional differences.
Taxonomic Characterization
Data similar to the Kraken output is included here: taxa lists for each sample (normalized as relative abundance out of 100%) and a merged taxa table for the sample set. In addition, clustered heatmaps and merged taxa tables specifically for species, genus, and family composition across samples are included, with a table specifying which species are associated with each cluster. The optimal number of clusters is estimated using a sum of squares “elbow” plot. In the cluster table, the species are ordered as they appear in the clustered heatmap.
We calculate and visualize alpha diversity for each sample’s microbiome, using the Shannon, Simpson, and Inverse Simpson indices, and to statistically compare that diversity between sample groups using the Games-Howell pairwise test. Comparative diversity is shown in box-violin plots with significant differences annotated with the p-value, and individual sample alpha diversity values are included in tabular format. In addition, we calculate a PCA matrix for the sample set by taxa, which we plot separately for each metadata category so clustering can easily be compared to different breakdowns of sample groups.
Example Folder Structure
The output from these scripts is returned in a folder structure similar to this.
-humann-assembly-free-analysis
|--metaphlan-classifications
|--metaphlan-alignments
|--sampleID_SQP_L001_RC_001_metaphlan_bowtie2.txt
|--...per sample
|--taxonomic-classifications
|--sampleID_SQP_L001_RC_001_metaphlan_bugs_list.tsv
|--...per sample
|--merged_metaphlan_table.txt
|--clustered-heatmaps
|--species-level
|--metaphlanS.clusternum.ClusteredHeatmap.pdf
|--metaphlan_species_clusters.txt
|--merged_metaphlan_table_species.txt
|--sumSquares.pdf
|--genus-level
|--...as for species
|--family-level
|--...as for species
|--pca-plots
|--pcoa_plot_category.png
|--...per category
|--alpha-diversity
|--InverseSimpsonAlphaDiversity_category.pdf
|--ShannonAlphaDiversity_category.pdf
|--SimpsonAlphaDiversity_category.pdf
|...per category
|--alpha-diversity-with-metadata.txt
|--DNA_UniRef90 (for metagenomics, paired metagenomics, and viral-enriched metagenomics)
|--pathway-abundance
|--normalized_pathabundance_stratified.tsv
|--normalized_pathabundance_unstratified.tsv
|--normalized_pathabundance.tsv
|--samples-raw
|--sampleID_SQP_L001_RC_001_pathabundance.tsv
|--...per sample
|--samples-normalized
|--sampleID_SQP_L001_RC_001_pathabundance-copm.tsv
|--...per sample
|--pathway-coverage
|--sampleID_SQP_L001_RC_001_pathcoverage.tsv
|--...per sample
|--gene-families
|--normalized_genefamilies_stratified.tsv
|--normalized_genefamilies_unstratified.tsv
|--samples-raw
|--sampleID_SQP_L001_RC_001_genefamilies.tsv
|--...per sample
|--samples-normalized
|--sampleID_SQP_L001_RC_001_genefamilies-cpm.tsv
|--...per sample
|--differential-functions
|--maaslin-model-specs.txt
|--PCA_Dimensions.pdf
|--PCA_Plot.pdf
|--maaslin2_90_model
|--category.pdf (boxplots for all significant results for category)
|--...per sample
|--all_results.tsv
|--significant_results.tsv
|--heatmap.pdf
|--maaslin2.log
|--features
|--filtered_data.tsv
|--filtered_data_norm.tsv
|--filtered_data_norm_transformed.tsv
|--figures
|--category_1.png
|--...per significant result per category
|--heatmap.png
|--fits
|--fitted.rds
|--ranef.rds
|--residuals.rds
|--DNA_UniRef50 (for metagenomics, paired metagenomics, and viral-enriched metagenomics)
|--...as for DNA_UniRef90
|--RNA_UniRef90 (for metatranscriptomics and viral-enriched metatranscriptomics)
|--...as for DNA_UniRef90
|--RNA_UniRef50 (for metatranscriptomics and viral-enriched metatranscriptomics)
|--...as for DNA_UniRef90
|--RNA_UniRef90 (for paired metatranscriptomics)
|--pathway-abundance
|--...as for DNA_UniRef90
|--pathway-coverage
|--...as for DNA_UniRef90
|--gene-families
|--...as for DNA_UniRef90
|--differential-functions
|--maaslin-significant-heatmap.pdf
|--maaslin-significant-exclusive-heatmap.pdf
|--MTX.paired-maaslin.R
|--MTXmodel-output
|--category.pdf
|--...per category
|--all_results.tsv
|--significant_results.tsv
|--mtx_model.log
|--sig-qval-forR
|--figures
|--category_1.png
|--...per significant result per category
|--heatmap.png
|--fits
|--fitted.rds
|--ranef.rds
|--residuals.rds
|--RNA_UniRef50 (for paired metatranscriptomics)
|--...as for RNA_UniRef90 (for paired metatranscriptomics)
Next module for metagenomics: Metagenome Assembly and Annotation
Next module for paired metagenomic/metatranscriptomics: Metagenome Assembly and Annotation
Next module for viral-enriched metagenomics: Metagenome Assembly and Annotation
Next module for metatranscriptomics: Metagenome Assembly and Annotation
Next module for viral-enriched metatranscriptomics: Metagenome Assembly and Annotation