This module is used for metagenomics, paired metagenomics/metatranscriptomics, and viral-enriched metagenomics workflows.
Contig Clustering
We use METABAT to bin contigs from our co-assembly into potential metagenome-assembled genomes (MAGs), using the alignment data from each sample as a guide. METABAT is an unsupervised algorithm that clusters contigs based on the frequency of tetra-nucleotide sequences and the average base coverage of a contig within a given sample. Because it isn’t based on reference genomes, it has the potential to capture environmental genomes that are previously uncharacterized (although it’s difficult to ascertain the quality of MAGs that don’t match any known reference sequence).
Binning is only performed for metagenomics data, as the uneven expression of genes in metatranscriptomics data would cause a problem given METABAT’s use of average base coverage!
Bin Annotations
For each bin, we use CAT/BAT to identify marker genes and classify the taxonomy of each bin. The output we return includes a text file containing the top hit for each ORF within each bin, amino acid fasta and GFF files for the concatened set of predicted proteins, and classification tables with numeric taxa IDs or human readable taxa names.
Additionally, we predict taxonomy for each bin using CheckM, which tends to classify bins much more conservatively than CAT/BAT. CheckM also evaluates the quality of each bin by estimating the completeness of the contigs and the potential contamination of the bin by other species. We return an Excel file containing the classifications and quality scores for each bin.
We include a joint summary of the classification and quality of each bin in the output, to simplify bin evaluation. A well-defined MAG should be classified to the genus or species level by at least one tool have high checkM completeness and low checkM contamination and strain heterogeneity, and have BAT lineage scores of at least 0.5.
Example Folder Structure
The output files from binning analysis are provided in a folder set up similar to this.
-binning-MAGs
|--bin-classification-summary.csv
|--checkM.bin.classifications.xlsx
|--out.BAT.bin2classification.official.txt
|--out.BAT.bin2classification.txt
|--out.BAT.concatenated.predicted_proteins.faa
|--out.BAT.concatenated.predicted_proteins.gff
|--metabat-bins
|--bin.1.fa
|--...per each bin
Next module for paired metagenomic/metatranscriptomics: RNA Expression Profiling