-
Notifications
You must be signed in to change notification settings - Fork 52
1. How DRAM Works
DRAM takes as input the path to a fasta file or a path with wildcards that lead to multiple fasta files. Files are each processed independently and results are merged after annotation is complete. Each file is first filtered to remove short contigs (by default contigs < 5000 bp). Then prodigal is used to detect open reading frames and predict gene amino acid sequences.
DRAM searches all amino acid sequences against a variety of databases to annotate gene sequences and all annotations are considered together (Figure 1). Gene sequences are searched against KEGG, Uniref90 and MEROPS using mmseqs2. All hits above a minimum bit score threshold (default 60) are reported. Hits from these databases are used for reciprocal best hits searches. A match is considered a reciprocal best hit if the gene is the top hit in the reverse search of the forward hit against all detected genes in the fasta and the bit score of the reverse search is above 350. All gene sequences are compared to HMM profiles using HMMER. All profiles from PFAM, dbCAN and VOGDB are used and all hits against all genes are reported. A hit is recorded if the coverage length is greater than 80% and the e-value is less than 10-5 or if the coverage length is less than 80% and the e-value is less than 10-3. Because not all users have access to a KEGG subscription KOfam is used to assign KOs if KEGG genes are not provided. After open reading frame annotation tRNAs are detected using tRNAscan-SE and rRNAs are detected using barrnap. Users can also provide tab separated files with taxonomy and bin quality information including directly from GTDB-tk and checkM which are used to add taxonomy and genome quality information, respectively, to the annotation table.
After all annotation is complete for all input files the results are merged. Additionally scores are given to the annotation of each gene representing the confidence of the annotation. The primary output is the annotations table where all annotations for each gene are reported. All scaffolds from all bins are provided in a fasta file along with a GFF3 file containing all annotation information. Fasta files of all genes as nucleotides and amino acids are given. Additionally a folder with one genbank file for each input fasta.
DRAM annotations are distilled using a script which takes the annotations.tsv as well as tRNAs and rRNAs detected in the annotation step. The genome statistics are built with the annotation file to get the taxonomy, completeness and contamination, gets tRNA counts from tRNA file and the locations of detected rRNA genes from the rRNA file to provide all information required for MIMAG. The genome summary file is built from a table containing all summarized metabolisms and the annotations file. The summarized metabolism form is collated from multiple sources. It includes KEGG modules, CAZy genes separated by substrate, MEROPs genes, tRNAs and custom modules. In the output identifiers are split across multiple sheets based on functional categories with additional layers added. Gene counts are measured per genome by counting the number of times a gene with the identifier associated with each function is present.
The annotations are further distilled to form the liquor. The table used to make the liquor was made via curation of specific metabolisms that are commonly of interest in a variety of microbial communities. The liquor has three primary parts: KEGG module coverage, electron transport chain component completion and specific function presence. The modules selected for KEGG module coverage were chosen because of their central role in metabolism. Pathway coverage is measured using the structure of KEGG modules. Modules are broken up into steps and then each step is divided into paths. Paths can be additionally subdivided into substeps with subpaths. Coverage is given as the percent of steps which are present, substeps and subpaths are not considered. A step is considered present if at least one gene is present from at least one path. This does not require all subunits of all proteins to be present. Electron transport chain component completion is measured similarly. Components are all KEGG modules and completion is measured by the number of genes present in a path through the module. Modules are represented as directed networks where KOs are nodes and outgoing edges connect to the next KOs in the module. Completion is the path through the network with the largest percentage of genes present in that path. Function presence is measured based on the presence of genes with a set of identifiers. Some functions look for the presence of a single gene while others only require one or more of a set of genes to be present. Other more complex traits require multiple parts to be present where at least one gene from each of multiple sets must be present. These parts are shown in the form of a heatmap created using altair.
DRAM-v depends on the outputs of VIRSorter to annotate viral contigs and determine potential AMGs. It takes as input the predicted viral contigs from VIRSorter and the VIRSorter_affi-contigs.tab file. Predicted viral contigs are provided as a single fasta file. This can be a subset of contigs from the predicted viral contigs output. The fasta is created by concatenating the predicted viral sequences from VIRSorter. All contigs are first processed using the same pipeline as in DRAM with the addition of a BLAST-type annotation with all viral proteins in NCBI RefSeq.
After annotation, auxiliary scores are assigned to each gene. The auxiliary scores are on a scale from 1 to 5 representing the confidence that a gene is viral in origin where a score of 1 represents a gene that is confidently viral and 5 a gene that users should take caution in treating as a viral gene. Auxiliary scores are assigned based on the category of flanking virus protein cluster from the VIRsorter_affi-contigs.tab file. A gene is given an auxiliary score of 1 if there is at least one hallmark gene (a VIRSorter protein cluster with category 0 or 3) on the left and right flank. Auxiliary scores of 2 are assigned when the gene has a hallmark gene on one flank and a viral-like gene (a VIRSorter protein cluster with category 1 or 4) on the other flank. Auxiliary scores of 3 are assigned to genes that have a viral like gene on both flanks. An auxiliary score of 4 is given to genes with a viral-like or hallmark gene on one flank and no viral-like or hallmark gene on the other flank and all genes that are part of a stretch with three or more adjacent genes with non-viral metabolic function. An auxiliary score of 5 is given to genes on contigs with no viral-like or hallmark genes and genes on the end of contigs.
Various flags that may change the confidence in a gene being viral are assigned. The viral flag (V) is assigned with the gene has assigned a VOGDB identifier with the replication or structure category. The metabolism flag (M) is assigned if the gene is present in summarized metabolism form. The known AMG flag (K) is assigned when the gene has been annotated with a database identifier representing a function from a previously identified AMG and the experimentally verified flag (E) is assigned when the gene database identifier is a previously identified AMG and has been experimentally verified to affect host metabolism. The attachment flag (A) is given when the gene has been given identifiers associated with viral host attachment and entry. The near contig end flag (F) is given when the gene is within 5000 bases of the end of a contig. The transposon flag (T) is given when the gene is on a contig that contains a transposon. The bacterial flag (B) is given when three genes in a row are given the metabolism flag and not the viral or viral attachment and entry flags. The distillation of DRAM-v annotations are largely based around the detection of potential AMGs. By default, a gene is considered a potential AMG if the auxiliary score is less than 4, has been assigned an M flag, has not been assigned an A, V or T flag. The flags and minimum auxiliary score threshold can be changed by the user. DRAM-v annotations are distilled to create a viral contig summary and a potential AMG summary. The viral contig summary is a table with each contig and information about the contig. Included are the VIRSorter category of the virus, if the virus was circular, if the virus is a prophage, the number of genes in the viruses, the number of strand switches along the contig, if a transposase is present on the contig and the number of potential AMGs. We also summarize the potential AMGs giving the metabolic information associated with each one as found in the summarized metabolism form. The DRAM-v liquor further summarizes the potential AMGs showing all viral contigs, the number of potential AMGs in each contig and a heatmap with the modules that each AMG are a part of.