You must be signed in to change notification settings - Fork 9
How to run the workflow
Once you've created the sample list, and modified the config file the most basic way to run the workflow is with the command:
snakemake —-use-conda --configfile <myconfig.yaml> --cores 4
Here --cores
specifies how many CPU cores to use in parallell.
Other useful options that you can specify to snakemake on the command line include:
: print the reason for execution of each rule -
: print shell commands that will be executed -
: do not lock the working directory -
: unlock working directory -
: re-run all jobs where output may be incomplete -
--jobs [N]
or-j [N]
: Use at most N CPU cores/jobs in parallel. If-j
is used with the SLURM profile to submit jobs in a compute cluster infrastructure,-j
specifies the maximum number of jobs to submit to the queue.
For a full list of snakemake command line options, see here.
If you don't want to run the full workflow from start to finish in one go you may specify one or several 'targets' on the commandline. Some useful targets are:
Target | Description |
qc |
Runs preprocessing (as specified in the configuration file) and generates a sample_report.html
assemble |
Assembles samples according to sample list and config file, and generates some assembly statistics |
quantify |
Quantifies open reading frames called on assembled contigs, producing RPKM-normalized and raw counts |
annotate |
Annotates open reading frames called on assembled contigs using settings defined in the config file. Also quantifies genes and features, producing normalized and raw counts |
taxonomy |
Assigns taxonomy to assembled contigs and open reading frames called on those contigs |
bin |
Performs genome binning of assembled contigs and generates some statistics of the binned genomes |
classify |
Read-based (e.g. Kraken/Centrifuge/MetaPhlAn) classification of preprocessed reads |
To use these targets add them to the snakemake command line call. For instance, to run only the preprocessing part:
snakemake --use-conda --configfile config.yaml -j 4 qc
Targets may also be combined, so if you want to generate assemblies and run read-based classification you can do:
snakemake --use-conda --configfile config.yaml -j 4 assemble classify
After the workflow has completed you can generate a report with summarized statistics of the run. Depending on the run, the report will also include links to output files produced (e.g. tables, plots and html files). To produce a report, run:
snakemake --report report.html
IMPORTANT: When generating the report you must call snakemake the same way you did when you ran the workflow itself otherwise snakemake will report a WorkflowError:
because the expected output is not present.
As an example, say you have a config file config.yaml
specifying to run preprocessing and assembly of your samples and you run the workflow as such:
snakemake --use-conda -j 4 --configfile config.yaml
When the workflow is finished you can then generate a report by running:
snakemake --use-conda -j 4 --configfile config.yaml --report report.html
To see an example of what the report may look like click here to download a report from one of the test runs of the workflow.
Here are a few common examples. They are written in a structure showing the relevant configuration parameters, the command(s) to run and the expected output. All examples assume you have a configuration file called config.yaml
with the appropriate parameters, but you may of course use any config file name you want. A suggestion is to make a copy of the default config file and make your changes in the copy.
# run Megahit assembler?
megahit: True
# Use Metaspades instead of Megahit for assembly?
metaspades: False
# maximum threads for megahit
threads: 20
# keep intermediate contigs from Megahit?
keep_intermediate: False
# extra settings passed to Megahit
extra_settings: "--min-contig-len 300 --prune-level 3"
snakemake --use-conda --configfile config.yaml -j 4 -p assemble
|- assembly/
| |- <assembly1>/final_contigs.fa the fasta file with assembled contigs
| |- ...
| |- <assemblyN>/final_contigs.fa
|- report/
| |- assembly/
| | |- assembly_stats.txt table of assembly statistics
| | |- assembly_size_dist.txt file with sizes of assemblies contained at different contig lengths
| | |- assembly_stats.pdf a plot of general assembly statistics
| | |- assembly_size_dist.pdf a plot of the size distribution of the assembly
| | |- alignment_frequency.pdf a plot of the overall alignment frequency after mapping reads to assembled contigs
NOTE: To use the Metaspades assembler, simply change your config file to:
metaspades: True
Open reading frames called on assembled contigs can be annotated using eggnog-mapper
, pfam_scan
and rgi
(Resistance Gene Identifier). If you are running the workflow on the Uppmax compute cluster you can use centrally installed databases for the first two of these, see more under the section Running the workflow on Uppmax.
Using these settings in your config file runs all three tools to annotate protein sequences in your assemblies.
# run eggnog-mapper to infer KEGG orthologs, pathways and modules?
eggnog: True
# run PFAM-scan to infer protein families from PFAM?
pfam: True
# run Resistance gene identifier?
rgi: True
snakemake --use-conda --configfile config.yaml -j 4 -p annotate
The workflow runs the recently released version 3 of Metaphlan. MetaPhlAn aligns reads to a set of core marker genes and estimates abundances of taxonomic clades in your samples.
metaphlan: True
snakemake --use-conda --configfile config.yaml -j 4 -p classify
|- metaphlan/ raw, per sample output from metaphlan
|- report/
|- metaphlan/
| |- metaphlan.tsv clade relative abundances per sample
| |- metaphlan.pdf clustermap of relative abundance summed to <metaphlan_plot_rank>
| |- metaphlan.html Krona interactive plot (Linux only)
There are pre-built kraken databases available at https://benlangmead.github.io/aws-indexes/k2. To make use of e.g. the Greengenes prebuilt database, copy its HTTPS url and run:
mkdir -p temp/kraken_db
mkdir -p resources/kraken/prebuilt/16S_Greengenes/
curl -L -o kraken.tgz <HTTPS-url>
tar -C temp/kraken_db -xf kraken.tgz
mv temp/kraken_db/*/* resources/kraken/prebuilt/16S_Greengenes
rm -r temp/16S_Greengenes.tgz temp/kraken_db
This installs the prebuilt database under `resources/kraken/prebuilt/16S_Greengenes. To configure the workflow to use this database make sure your config file has the following setup:
standard_db: False
prebuilt: "16S_Greengenes"
kraken: True
snakemake --use-conda --configfile config.yaml -j 4 -p classify
|- kraken/ raw, per sample output from kraken2
|- report/
|- kraken/
| |- kraken.krona.html Krona interactive plot (Linux only)