How to run the workflow

Useful snakemake options

Other useful options that you can specify to snakemake on the command line include:

-r or --reason: print the reason for execution of each rule
-por --printshellcmds: print shell commands that will be executed
--nolock: do not lock the working directory
--unlock: unlock working directory
--ri or --rerun-incomplete: re-run all jobs where output may be incomplete
--jobs [N] or -j [N]: Use at most N CPU cores/jobs in parallel. If -j is used with the SLURM profile to submit jobs in a compute cluster infrastructure, -j specifies the maximum number of jobs to submit to the queue.

For a full list of snakemake command line options, see here.

Targets

If you don't want to run the full workflow from start to finish in one go you may specify one or several 'targets' on the commandline. Some useful targets are:

Target	Description
`qc`	Runs preprocessing (as specified in the configuration file) and generates a `sample_report.html`
`assemble`	Assembles samples according to sample list and config file, and generates some assembly statistics
`quantify`	Quantifies open reading frames called on assembled contigs, producing RPKM-normalized and raw counts
`annotate`	Annotates open reading frames called on assembled contigs using settings defined in the config file. Also quantifies genes and features, producing normalized and raw counts
`taxonomy`	Assigns taxonomy to assembled contigs and open reading frames called on those contigs
`bin`	Performs genome binning of assembled contigs and generates some statistics of the binned genomes
`classify`	Read-based (e.g. Kraken/Centrifuge/MetaPhlAn) classification of preprocessed reads

To use these targets add them to the snakemake command line call. For instance, to run only the preprocessing part:

snakemake --use-conda --configfile config.yaml -j 4 qc

Targets may also be combined, so if you want to generate assemblies and run read-based classification you can do:

snakemake --use-conda --configfile config.yaml -j 4 assemble classify

Reports

After the workflow has completed you can generate a report with summarized statistics of the run. Depending on the run, the report will also include links to output files produced (e.g. tables, plots and html files). To produce a report, run:

snakemake --report report.html

IMPORTANT: When generating the report you must call snakemake the same way you did when you ran the workflow itself otherwise snakemake will report a WorkflowError: because the expected output is not present.

As an example, say you have a config file config.yaml specifying to run preprocessing and assembly of your samples and you run the workflow as such:

snakemake --use-conda -j 4 --configfile config.yaml

When the workflow is finished you can then generate a report by running:

snakemake --use-conda -j 4 --configfile config.yaml --report report.html

To see an example of what the report may look like click here to download a report from one of the test runs of the workflow.

Examples

Here are a few common examples. They are written in a structure showing the relevant configuration parameters, the command(s) to run and the expected output. All examples assume you have a configuration file called config.yaml with the appropriate parameters, but you may of course use any config file name you want. A suggestion is to make a copy of the default config file and make your changes in the copy.

Assembly-based analysis

Assemble reads with Megahit

Configuration

assembly:
  # run Megahit assembler?
  megahit: True
  # Use Metaspades instead of Megahit for assembly?
  metaspades: False

megahit:
  # maximum threads for megahit
  threads: 20
  # keep intermediate contigs from Megahit?
  keep_intermediate: False
  # extra settings passed to Megahit
  extra_settings: "--min-contig-len 300 --prune-level 3"

Command

snakemake --use-conda --configfile config.yaml -j 4 -p assemble

Output

results
|- assembly/               
|  |- <assembly1>/final_contigs.fa   the fasta file with assembled contigs
|  |- ...
|  |- <assemblyN>/final_contigs.fa   
|- report/
|  |- assembly/
|  |  |- assembly_stats.txt          table of assembly statistics         
|  |  |- assembly_size_dist.txt      file with sizes of assemblies contained at different contig lengths
|  |  |- assembly_stats.pdf          a plot of general assembly statistics
|  |  |- assembly_size_dist.pdf      a plot of the size distribution of the assembly
|  |  |- alignment_frequency.pdf     a plot of the overall alignment frequency after mapping reads to assembled contigs

NOTE: To use the Metaspades assembler, simply change your config file to:

assembly:
  metaspades: True

Protein-level annotations

Open reading frames called on assembled contigs can be annotated using eggnog-mapper, pfam_scan and rgi (Resistance Gene Identifier). If you are running the workflow on the Uppmax compute cluster you can use centrally installed databases for the first two of these, see more under the section Running the workflow on Uppmax.

Configuration

Using these settings in your config file runs all three tools to annotate protein sequences in your assemblies.

annotation:
  # run eggnog-mapper to infer KEGG orthologs, pathways and modules?
  eggnog: True
  # run PFAM-scan to infer protein families from PFAM?
  pfam: True
  # run Resistance gene identifier?
  rgi: True

Command

snakemake --use-conda --configfile config.yaml -j 4 -p annotate

Read-based analysis

Metaphlan

The workflow runs the recently released version 3 of Metaphlan. MetaPhlAn aligns reads to a set of core marker genes and estimates abundances of taxonomic clades in your samples.

Configuration

classification:
  metaphlan: True

Command

snakemake --use-conda --configfile config.yaml -j 4 -p classify

Output

results
|- metaphlan/               raw, per sample output from metaphlan 
|
|- report/
|- metaphlan/               
|  |- metaphlan.tsv         clade relative abundances per sample
|  |- metaphlan.pdf         clustermap of relative abundance summed to <metaphlan_plot_rank>
|  |- metaphlan.html        Krona interactive plot (Linux only)

Kraken2

There are pre-built kraken databases available at https://benlangmead.github.io/aws-indexes/k2. To make use of e.g. the Greengenes prebuilt database, copy its HTTPS url and run:

mkdir -p temp/kraken_db
mkdir -p resources/kraken/prebuilt/16S_Greengenes/
curl -L -o kraken.tgz <HTTPS-url>
tar -C temp/kraken_db -xf kraken.tgz
mv temp/kraken_db/*/* resources/kraken/prebuilt/16S_Greengenes
rm -r temp/16S_Greengenes.tgz temp/kraken_db

This installs the prebuilt database under `resources/kraken/prebuilt/16S_Greengenes. To configure the workflow to use this database make sure your config file has the following setup:

kraken:
  standard_db: False
  prebuilt: "16S_Greengenes"

Configuration

classification:
  kraken: True

Command

snakemake --use-conda --configfile config.yaml -j 4 -p classify

Output

results
|- kraken/                  raw, per sample output from kraken2
|
|- report/
|- kraken/               
|  |- kraken.krona.html     Krona interactive plot (Linux only)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run the workflow

Contents

Useful snakemake options

Targets

Reports

Examples

Assembly-based analysis

Assemble reads with Megahit

Configuration

Command

Output

Protein-level annotations

Configuration

Command

Read-based analysis

Metaphlan

Configuration

Command

Output

Kraken2

Configuration

Command

Output

Clone this wiki locally