Skip to content

Infer pathway abundances

Gavin Douglas edited this page Aug 1, 2019 · 21 revisions

Pathway abundances are calculated using the same approach as HUMAnN2 based on the abundances of gene families that can be linked to reactions within pathways (E.C. numbers regrouped to MetaCyc reactions be default). By default, pathways will first be identified as present or not with MinPath.

Either a structured or unstructured pathway mapfile can be input (the mapfile is structured by default), which will identify which set of pathways are likely present based on the presence of requisite gene families.

Input gene family abundances can be stratified or unstratified by contributing organisms; however, stratified pathway abundances will only be written if the input gene families are in stratified format. Note that stratified abundances refer to how much each predicted genome is contributing to the community pathway abundances (not the predicted level of that pathway within that organism alone!). To get pathway abundances broken down by contributing sequence you need to use the --per_sequence_contrib option (see below).

There are two default mapfiles used by this script. These files are specified by default so you do not need to specify them yourself! However, it is useful to understand what this script does by default. First E.C. numbers are regrouped to MetaCyc RXNs using this mapfile: default_files/pathway_mapfiles/ec_level4_to_metacyc_rxn.tsv. These MetaCyc RXNs can then be used to infer MetaCyc pathway abundances using this mapfile: default_files/pathway_mapfiles/metacyc_path2rxn_struc_filt_pro.txt. This second mapfile contains maps of reactions to pathways for the subset of MetaCyc pathways found in prokaryotes.

Use this command to run MinPath on the outputted predicted gene families to get unstratified pathway abundances (of pathways found in prokaryotes):

pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz \
                    -o pathways_out \
                    --intermediate minpath_working \
                    -p 1

The input arguments and options to this command are:

  • -i metagenome_out/pred_metagenome_strat.tsv.gz - Stratified or unstratified output of metagenome_pipeline.py
  • -m MINPATH_MAPFILE - path to mapfile of gene families to pathways of interest (default: default_files/pathway_mapfiles/metacyc_path2rxn_struc_filt_pro.txt).
  • -o pathway_out - Output folder to write final pathway abundance tables.
  • --intermediate - Optional folder where intermediate files will be written (otherwise the intermediate files will not be kept).
  • --coverage - Calculate pathway coverages as well as abundances, which are an alternative way to identify which pathway are present. Note that these values are experimental and only useful for advanced users. Coverage is also calculated using the same approach as HUMAnN2.
  • --no_gap_fill - Turn of gap-filling (which boosts the lowest reaction abundance in a pathway by default).
  • --no_regroup - Turn off re-grouping to reactions: this is necessary if the gene families you are inputting can be directly related to pathways.
  • --skip_minpath -Do not run MinPath to identify which pathways are present as a first pass (MinPath is run by default).
  • --regroup_map - Mapfile to use for regrouping input gene family abundances to reactions (default: default_files/pathway_mapfiles/ec_level4_to_metacyc_rxn.tsv)
  • -p INT - Number of processes to run in parallel.
  • --per_sequence_contrib - Option to specify that stratified abundances should be reported in terms of the contribution by each predicted genome rather than how much each genome is contributing to the overall community abundance. In other words, pathway abundances will be calculated for each individual predicted genome. Both --per_sequence_abun and --per_sequence_function need to be specified when this option is set. Stratified coverages will only be reported when this option is used (and --coverage is set). As of v2.2.0-b, unstratified pathway abundances based on the community-wide pathway abundances and also based on the per-sequence pathway abundances will be output when this option is used.
  • --per_sequence_abun - Path to sequence abundance table normalized by marker-gene abundances (file output by metagenome_pipeline.py step named "seqtab_norm.tsv.gz" by default).
  • --per_sequence_function - Path to predicted gene family abundance per sequence (main output file of hidden-state prediction step named "EC_predicted.tsv.gz" by default).
  • --wide_table - Flag to specify that wide-format stratified table should be output rather than metagenome contribution table. This is the deprecated method of generating stratified tables since it is extremely memory intensive (default: False).