Menu Close

Metagenome Assembly and Annotation

Most of this module is used for all workflows, but paired metatranscriptomics samples use the assembly from their metagenomic pairs instead of a de novo coassembly.

Complete Contig-Level Metagenome Assembly

In order to run analyses that aren’t based solely on the individual reads, we create an assembly genome to serve as a “reference” of sorts for the microbiome as a whole. We choose to build a contig assembly integrating all reads from the dataset instead of individual assemblies for each sample. This strategy allows for more complete MAGs and more thorough alignment and abundance profiling, especially from species that are underrepresented in a specific sample but more prevalent in another sample (so that sequence overlaps allow for the assembly of longer contigs). The program MegaHit2 was used for the actual assembly, Quast2 and Gfastats2 were used to generate quality and metrics summaries for the assembly. Bowtie2 is used to create an index for downstream alignment of DNA reads, and HISAT2 is used to create an index for downstream alignment of RNA reads.

Alignment to Co-Assembly

To determine the abundance of each contig within each sample, we align the reads from each sample back onto the co-assembly; ideally, overall successful mapping rates should be over 95% to signify that the co-assembly has captured the diversity of each individual population. For DNA reads we use Bowtie2 for this alignment, and for RNA reads we use Hisat2 (using the corresponding indexes we made for the co-assembly above).

Co-Assembly Annotation and Functional Profiling

We use the tool CAT/BAT to predict proteins in the co-assembly, which are then used to classify the taxa of each contig and the abundance of each taxa within the population of each sample (integrating the alignment information from the previous step). The output we return includes an amino acid fasta file containing the predicted protein sequences, a GFF file mapping the predicted proteins onto the contigs, a summary file showing the number of hits and top lineage score for each contig, and a classification file detailing the taxon ID assigned to each contig.

We then use MicrobeAnnotator to identify functional annotations for the predicted proteins in the assembly, and determine the abundance of each of those proteins in the individual microbiomes. A text version of the annotations for each contig, including both the CAT/BAT GFF data and the functional predictions from MicrobeAnnotator, is provided (and is used later for annotating the gene and transcript count matrices for the metagenomics and metatranscriptomics samples).

Example Folder Structure

The output from this portion of the analysis is returned in a folder structure similar to this. SAM and BAM files for the DNA and/or RNA alignment to the co-assembly are included in case you’d like to run additional analyses in the future.

|--DNA-alignments (for metagenomics, paired metagenomics/metatranscriptomics, and viral-enriched metagenomics workflows)
    |--...per sample
    |--...per sample
    |--...per sample
|--RNA-alignments (for metatranscriptomics, paired metagenomics/metatranscriptomics, and viral-enriched metatranscriptomics workflows)
    |--...per sample

Return to Top

Next module for metagenomics: Functional Genomics Potential Profiling
Next module for paired metagenomic/metatranscriptomics: Functional Genomics Potential Profiling
Next module for viral-enriched metagenomics: Functional Genomics Potential Profiling
Next module for metatranscriptomics: RNA Expression Profiling
Next module for viral-enriched metatranscriptomics: RNA Expression Profiling