Best practices, wrappers, schemas, report, config files, and more
Per Unneberg
-# Contents
A best practice repo
Wrappers and scripts
Configuration and schemas
Coding practices and hints
-- Very simple examples with snakefiles and code to run
-- All snakefiles and code is available in code repository
-- code has been run with Snakemake version 6.8.1
-# [Snakemake best practices summary](
-: Snakemake (>=5.11) comes with a code quality checker (a so called
- linter). It is highly recommended to run the linter before
- publishing any workflow, asking questions on Stack Overflow or
- filing issues on Github.
-: There is an automatic formatter for Snakemake workflows, called
- Snakefmt, which should be applied to any Snakemake workflow before
- publishing it.
-: It is a good idea to add some minimal test data and configure Github
- Actions for continuously testing the workflow on each new commit.
-: Stick to a standardized structure.
-: Configuration of a workflow should be handled via config files and,
- if needed, tabular configuration like sample sheets (either via
- Pandas or PEPs). Use such configuration for metadata and experiement
- information, **not for runtime specific configuration like threads,
- resources and output folders**. For those, just rely on Snakemake’s
- CLI arguments like --set-threads, --set-resources,
- --set-default-resources, and --directory.
-: Try to keep filenames short, but informative.
-Rules and functions
-: Try to keep Python code like helper functions separate from rules.
-: Make use of Snakemake wrappers whenever possible
-# A best practice repo
-Clone the repo (`git clone`) and list
-```{bash snakemake-byoc-2021-bp-overview, cache=TRUE }
-tree -a -d -L 2 -I '.snakemake|.git' snakemake_best_practice
-## What does it do?
-Excerpts from
-```{r snakemake-byoc-2021-bp-readme, code=readLines("snakemake_best_practice/")[c(1:25)], eval=FALSE, highlight=FALSE }
-```{r snakemake-byoc-2021-bp-readme-tail, code=readLines("snakemake_best_practice/")[c(106:112)], eval=FALSE, highlight=FALSE, attr.source='startFrom="106"' }
-Use a test data set for test driven development of the workflow. It
-also gives a new user a quick idea of how to organize input files and
-## Dry-run the test suite
-```{bash snakemake-byoc-2021-dry-run }
-cd snakemake_best_practice/.test
-snakemake -s ../workflow/Snakefile -n -q -F
-Question: is there a way to validate configuration files, require
-inputs and make sure they conform to some predefined format?
-## Configuration schemas
-Schema benefits according to []():
-- describes your existing data formats
-- provides human- and machine-readable **documentation**
-- validates data input
-# Reports
-From snakemake 5.1 and on, generate detailed self-contained HTML
-reports that encompass runtime statistics, provenance information,
-workflow topology and results
-## The report directive
-```{python snakemake-report, code=readLines("snakemake_best_practice/workflow/Snakefile")[13:13], eval=FALSE, attr.source='startFrom="13"'}
-Workflow report template defined by `workflow/report/workflow.rst`.
-Use `report` flag to target results for inclusion in report, which
-could optionally point to an rst file for captioning.
-```{python r-plot-report, code=readLines("snakemake_best_practice/workflow/rules/qc.smk")[63:81], eval=FALSE, attr.source='startFrom="63"'}
-A linter is a code quality checker that analyzes your code and
-highlights issues that need to be resolved to follow best practices.
-```{r cd_to_snakemake_best_practice_2, echo=FALSE }
-```{bash snakemake-lint }
-snakemake --lint
-[snakefmt]( is an automated code
-formatter that should be applied to the workflow prior to publication.
-```{bash snakemake-fmt }
-snakefmt --compact-diff workflow/Snakefile
-```{r cd_back_2, echo=FALSE }
-## Pre-commit - for the git power user
-[Git hooks]( can
-be used to identify simple issues before submission to code review.
-[Pre-commit]( is
-a "framework for managing and maintaining multi-language pre-commit
-Install git hooks
-```{bash pre-commit, eval=FALSE}
-pre-commit install
-and see how many warnings you get when you try to commit!
-## Github actions for continuous integration
-[Snakemake github
-action]( allows
-running the test suite on github to make sure commits and pull
-requests don't break the workflow.
-## On project file structure vs workflow file structure
-Example from my config which is loosely modelled on the
-setup and similar to the NBIS reproducibility file structure:
-```{bash project-file-structure, cache=TRUE, echo=FALSE }
-tree -a -d -L 2 -I '.snakemake|.git' project
-Different snakemake workflows live in `opt` (see [File System Hierachy
-standard]( for choice of
-name). Launching from project root could then look like
-```{bash project-structure-launch, eval=FALSE}
-snakemake -s opt/datasources-smk/workflow/Snakefile -j 1
-# Questions?
+title: Best practices in detail
+subtitle: An overview of best practices, wrappers, schemas, report, config files, and more
+author: Per Unneberg
+date: "2 September, 2022"
+institute: NBIS
+from: markdown+emoji
+ revealjs:
+ theme:
+ - white
+ - ../custom.scss
+ self-contained: false
+ toc: false
+ toc-depth: 1
+ slide-level: 2
+ slide-number: true
+ preview-links: true
+ chalkboard: true
+ # Multiple logos not possible; would need to make custom logo combining both logos
+ footer: Snakemake BYOC 2022 - Best practices
+ logo:
+ smaller: true
+ highlight-style: gruvbox
+ fig-height: 3
+ fig-width: 3
+ echo: true
+ warning: false
+ cache: false
+ include: true
+ autodep: true
+ eval: true
+ error: true
+ opts_chunk:
+ code-fold: false
+ tidy: false
+ fig-format: svg
+## Setup {.unnumbered .unlisted}
+```{r }
+#| label: setup
+#| echo: false
+#| eval: true
+#| cache: false
+bw <- theme_bw(base_size=24) %+replace% theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
+snakemake_version <- system("snakemake --version", intern=TRUE)
+knitr::knit_hooks$set(inline = function(x) {
+ prettyNum(x, big.mark=",")
+ })
+- Examples based on more advanced snakefiles and code to run
+- All snakefiles and code are available in code repository
+ [](
+- Code has been run with Snakemake version `r snakemake_version`
+The best practice example workflow is a mapping and basic qc workflow
+where snakemake best practices have been applied.
+::: {.fragment}
+#### Objective
+The objective of the lecture is to provide an overview of some
+advanced features and how to structure your code. Hopefully it can
+give you some ideas for developing your workflow. The material can be
+a bit overwhelming so see it as a smörgåsbord where you can pick
+things to your liking.
+# Snakemake best practices
+## [Snakemake best practices summary](
+::: {.incremental}
+: Snakemake (>=5.11) comes with a code quality checker (a so called
+ linter). It is highly recommended to run the linter before
+ publishing any workflow, asking questions on Stack Overflow or
+ filing issues on Github.
+: There is an automatic formatter for Snakemake workflows, called
+ Snakefmt, which should be applied to any Snakemake workflow before
+ publishing it.
+: It is a good idea to add some minimal test data and configure Github
+ Actions for continuously testing the workflow on each new commit.
+: Stick to a standardized structure.
+: Configuration of a workflow should be handled via config files and,
+ if needed, tabular configuration like sample sheets (either via
+ Pandas or PEPs). Use such configuration for metadata and experiment
+ information, **not for runtime specific configuration like threads,
+ resources and output folders**. For those, just rely on Snakemake’s
+ CLI arguments like --set-threads, --set-resources,
+ --set-default-resources, and --directory.
+: Try to keep filenames short, but informative.
+Rules and functions
+: Try to keep Python code like helper functions separate from rules.
+: Make use of Snakemake wrappers whenever possible
+::: {.notes}
+- not necessary to follow these guidelines - suggestions
+- there is however a need to comply with format to publish workflow in
+ snakemake workflow collection
+- order of importance: structure, filenames > test > configuration > lint/format > wrappers
+- if only snakemake were python
+## A best practice repo - standardized structure
+Clone the repo (`git clone`) and list
+:::: {.columns}
+::: {.column width="40%"}
+::: {.column width="60%"}
+::: {.incremental}
+: Designated test directory containing a small data set which ideally
+ should suffice to run all or parts of the workflow. Useful for
+ test-drived development.
+: Describe what the workflow does and how to use it
+: Contains top-level `Snakefile` that includes rules files stored in
+ the `rules` sub-directory. NB: this is the main entry point to the
+ workflow.
+: conda environment files loaded by rules
+: notebooks that can be called by the workflow
+: workflow report templates
+: workflow rules
+: schema files that describe and define configuration file and data formats
+: scripts called by workflow
+::: {.notes}
+Emphasize that **structure** is one of the important aspects
+## What does it do?
+The repo should contain a describing briefly what the
+workflow does. Here are some excerpts:
+```{r code=readLines("../")[c(1:25)]}
+#| label: snakemake-byoc-2021-bp-readme
+#| eval: false
+#| highlight: false
+::: {.fragment}
+```{r code=readLines("../")[c(106:112)]}
+#| label: snakemake-byoc-2021-bp-readme-tail
+#| eval: false
+#| highlight: false
+#| attr-source: startFrom="106"
+::: {.fragment}
+Use a test data set for test driven development of the workflow. It
+also gives a new user a quick idea of how to organize input files and
+## Dry-run the test suite
+```{bash }
+#| label: snakemake-byoc-2021-dry-run-echo
+#| eval: false
+cd snakemake_best_practice/.test
+snakemake -s ../workflow/Snakefile -n -q -F
+```{bash }
+#| label: snakemake-byoc-2021-dry-run
+#| echo: false
+snakemake -s ../workflow/Snakefile -n -q -F
+## Draw the workflow
+```{bash }
+#| label: dry-run-fig-command
+#| echo: true
+#| eval: false
+snakemake -s ../workflow/Snakefile --rulegraph | dot | display
+```{bash }
+#| label: dry-run-fig
+#| fig-format: svg
+#| output: asis
+#| echo: false
+snakemake -s ../workflow/Snakefile --rulegraph | dot -T svg | grep -v ""
+tree -a -N -F -L 2 -I '.snakemake|LICENSE|.git|resources*references.fasta|resources*|Dockerfile|environment.yaml|.gitignore|.gitattributes|.editorconfig|.pre-commit-config.yaml|config*config.yaml|config*samples.tsv|config*reads.tsv|.ipynb_checkpoints|.myprofile|logs|reports|results|interim|*.~undo-tree~|*.png|*.zip|*.html|.github' ../../snakemake_best_practice | sed -z "s/\n/ \n/g;s/Snakefile/Snakefile<\/span>/;s/\.\.\/\.\.\///" | head -n -2
+echo ""
+::: {.notes}
+- explain pseudo-targets
+- point out the two common idioms for collecting targets:
+1. expand
+2. input functions
+## Stuff common to all snakefiles
+::: {.absolute top=50 left=-200 }
+```{bash }
+#| label: snakemake-byoc-2022-common-margin-tree
+#| cache: false
+#| eval: true
+#| echo: false
+#| results: asis
+echo "
+tree -a -N -F -L 3 -I '.snakemake|LICENSE|.git|resources*references.fasta|resources*|Dockerfile|environment.yaml|.gitignore|.gitattributes|.editorconfig|.pre-commit-config.yaml|config*config.yaml|config*samples.tsv|config*reads.tsv|.ipynb_checkpoints|.myprofile|logs|reports|results|interim|*.~undo-tree~|*.png|*.zip|*.html|.github' -P "*.smk" ../../snakemake_best_practice | sed -z "s/\n/ \n/g;s/common.smk/common.smk<\/span>/;s/\.\.\/\.\.\///" | head -n -2
+echo "
+##### Usage #####
+Install git hooks
+```{bash }
+#| label: pre-commit
+#| eval: false
+pre-commit install
+and see how many warnings you get when you try to commit!
+## Github actions for continuous integration
+[Snakemake github
+action]( allows
+running the test suite on github to make sure commits and pull
+requests don't break the workflow.
+/* SCSS custom modifications for NBIS presentations using quarto, revealjs and rmarkdown */
+/*-- scss:defaults --*/
+/*-- scss:rules --*/
+.scroll-1000 {
+ max-height: 1000px;
+ overflow-y: auto;
+ background-color: inherit;
+.scroll-400 {
+ max-height: 400px;
+ overflow-y: auto;
+ background-color: inherit;
+.scroll-300 {
+ max-height: 300px;
+ overflow-y: auto;
+ background-color: inherit;
+.scroll-200 {
+ max-height: 200px;
+ overflow-y: auto;
+ background-color: inherit;
+pre.out {
+ background-color: lightgreen;
+pre.sourceCode.src {
+ background-color: lightblue;
+ { color: #85be42; font-weight: bold }
+code.tree {
+ line-height: 8px;
+ font-size: 18px;
+code.stree {
+ line-height: 6.6px;
+ font-size: 12px;
+code.sstree {
+ line-height: 5.6px;
+ font-size: 10px;
+code.large {
+ font-size: 20px;
+all: reproducibility-tools.html
+%.html: %.Rmd
+ Rscript -e 'rmarkdown::render("$<")'
+# OPENSSL_CONF due to
+%.pdf: %.html
+ OPENSSL_CONF=/dev/null Rscript -e 'library(webshot); webshot("$<", "$@")'
class: center, middle
-.HUGE[Combining Tools for Reproducible Research with Snakemake]
+.HUGE[Reproducible Research and Snakemake]
```{r Setup, echo = FALSE, message = FALSE}
# Knitr setup
@@ -33,436 +33,392 @@ library("kableExtra")
-# Reproducibility is rarer than you think
+# Reproducibility
-The results of only 26% out of 204 randomly selected papers in the journal
-*Science* could be reproduced.1
-.tiny[1 Stodden et. al (2018). "An empirical analysis of journal policy effectiveness for computational reproducibility". PNAS. 115 (11): 2584-2589]
+- Reproducible research is about being able to replicate the results of a study
+- It is an important aspect of the scientific method
+- **Computational reproducibility** is one part of it
+- Ideally, given the **same data** and the **same code**, there are identical outcomes
-> Many journals are revising author guidelines to include data and code
-> availability.
+*Code* encompasses
+- The workflow itself (→ `Snakefile`)
+- The helper scripts you are calling (→ `scripts/`)
+- The 3rd-party tools you are running/the execution environment (→ this lecture)
+# Computational reproducibility
+Why the effort?
+.tiny[M. Schwab et al. *Making scientific computations reproducible*.]
+> Because many researchers typically forget details
+> of their own work, they are not unlike strangers
+> when returning to projects after time away.
+> Thus, efforts to communicate your work to
+> strangers can actually help you communicate
+> with yourself over time.
-> (...) an improvement over no policy, but currently insufficient for
-> reproducibility.
+→ **You** are part of the target audience
-# Combining Tools for Reproducible Research with Snakemake
+# Don’t be *that* person[]
+*Science* implemented a replication policy in 2011.
+A study in 2018 requested raw data and code in accordance with the policy.
+Some answers:
+> When you approach a PI for the source codes and raw data, you better explain who you are,
+> whom you work for, why you need the data and what you are going to do with it.
-* Track your Snakemake code with .green[Git] and share it in a remote .green[repository] on GitHub or BitBucket (not covered in this lecture)
-* Combine Snakemake with .green[Conda] and/or .green[containers] to make the compute environment reproducible
+> I have to say that this is a very unusual request without any explanation!
+> Please ask your supervisor to send me an email with a detailed, and I mean detailed, explanation.
-* Integrate foreign workflow management systems such as .green[Nextflow] pipelines into your Snakemake workflow
+(26% out of 204 randomly selected papers in the journal could be reproduced.)
+.tiny[Stodden et. al (2018). *An empirical analysis of journal policy effectiveness for computational reproducibility*]
-# Conda
+# Combine tools to make research reproducible
-* Is a .green[package, dependency, and environment] manager[]
- > packages: any type of program (_e.g._ bowtie2, snakemake etc.)
+* Track code changes over time with .green[Git] and share it on [GitHub]( (not this talk)
- > dependency: other software required by a package
- > environment: a distinct collection of packages
+* Make your workflow reproducible with a workflow manager (.green[Snakemake], .green[Nextflow], .green[WDL])
-* Keeps track of the dependencies between packages in each environment
+* Make the execution environment reproducible with .green[Conda] environments and/or .green[containers]
-# Conda
+# Conda: a .green[package], .green[dependency], and .green[environment] manager
-## 1. Running a Snakemake rule with a Conda environment
+* Conda installs packages
+* Packages come from a central repository at
+* Users can contribute their own packages via *channels*
+* Highly recommended: The [Bioconda]( channel
-* Make sure you have Conda .green[installed] (Miniconda or Anaconda)
+# Using Conda
+* Install Conda, for example with [Miniconda](
-* Find your Conda .green[package] on
+* Set up the [Bioconda]( channel
-* Create a Conda .green[environment file] (e.g. `bwa.yaml`)
-```{python conda env one, eval = FALSE}
- - conda-forge
- - bioconda
- - defaults
- - bwa=0.7.17
+* Install Samtools and BWA into a new **Conda environment** named `mapping`:
+```{bash, eval=FALSE}
+$ conda create -n mapping samtools bwa
-.tiny[source: [best practice example](]
+* Conda also installs all .green[dependencies] – other software required by Samtools and/or BWA.
-* Store your `yaml` files in a directory for environments
+To use the tools in the environment, .green[activate] it:
+```{bash, eval=FALSE}
+$ conda activate mapping
+$ samtools --version
+samtools 1.15.1
-* For reproducibility, it is important to keep include package .green[versions] in your environment file
+* Install a tool into an existing environment:
+```{bash, eval=FALSE}
+conda install -n mapping bowtie2
+(Leaving out `-n mapping` installs into the currently active environment.)
-# Conda
+# Conda environments
-## 1. Running a Snakemake rule with a Conda environment
-* Add the .green[path] to the Conda environment `yaml` file to your rule using `conda`
+* You can have as many environments as you wish
-```{python conda rule, eval = FALSE}
-rule map_bwa_index:
- output: expand("{{ref}}{ext}", ext=[".amb", ".ann", ".bwt", ".pac", ".sa"])
- input: config["ref"]
- log: "logs/bwa/index/{ref}.log"
- conda: "../envs/bwa.yaml"
- shell:
- "bwa index {input}"
-.tiny[modified from: [best practice example](]
+* Environments are independent
-* Start your workflow on the command line with `--use-conda`
+* If something is broken, simply delete the environment and start over
-```{bash snakemake use conda, eval=FALSE}
-$ snakemake --use-conda
+```{bash, eval=FALSE}
+$ conda env remove -n mapping
-* This doesn't work if you use `run` (instead of `shell` or `script`)
+* To test a new tool, install it into a fresh Conda environment. Delete the environment to uninstall.
+* Find packages by searching []( or with `conda search`
-# Conda
+# Conda environment files
-## 2. Using one Conda environment for the entire workflow
+* Conda environments can be created from .green[environment files] in YAML format.
-* Write a Conda .green[environment file] that includes all tools used by the workflow (save it as e.g. `environment.yaml`)
+* Example `bwa.yaml`:
-```{python conda env big, eval=FALSE}
-name: best-practice-smk
+```{yaml conda env one, eval = FALSE}
- conda-forge
- bioconda
- - default
+ - defaults
- - snakemake=6.8.0
- - python=3.8
- - pandas=1.3.3
- - jupyter=1.0
- - jupyter_contrib_nbextensions=0.5.1
- - jupyterlab_code_formatter=1.4
- bwa=0.7.17
- - multiqc=1.11
- - r-ggplot2=3.3.5
- - samtools=1.13
-.tiny[source: [best practice example](]
+* Create the environment:
+```{bash, eval = FALSE}
+$ conda env create -n bwa -f bwa.yaml
-# Conda
+# Snakemake + Conda
-## 2. Using one Conda environment for the entire workflow
+## Option one: A single environment for the entire workflow
-* .green[Create] the environment
-```{bash conda create, eval=FALSE}
-$ conda env create -f environment.yml
+* Write an environment file (`environment.yaml`) that includes .green[all tools used by the workflow]:
+```{python conda env big, eval=FALSE}
+name: best-practice-smk
+ - conda-forge
+ - bioconda
+ - default
+ - snakemake=6.8.0 # ← Snakemake is part of the environment
+ - multiqc=1.11 # ← Version numbers for reproducibility
+ - samtools=1.13
-* .green[Activate] your Conda environment
-```{bash conda activate, eval=FALSE}
+* Create the environment, activate it and run the workflow within it:
+```{bash snakemake conda env, eval=FALSE}
+$ conda env create -f environment.yml
$ conda activate best-practice-smk
+$ snakemake
+* Possibly helpful: `conda export -n envname > environment.yaml`
-* Start your Snakemake workflow
-```{bash snakemake conda env, eval=FALSE}
-(best-practice-smk) [...] $ snakemake
+.tiny[source: [best practice example](]
+# Snakemake + Conda
-# Containers
+## Option two: Rule-specific environments
-## What can I use containers for?
+You can let Snakemake create and activate Conda environments for you.
-* Run applications securely .green[isolated] in a container, packaged with .green[all dependencies and libraries]
+1. Create the environment file, such as `envs/bwa.yaml` (`envs/` is best practice)
-* As advanced .green[environment manager]
-* To package your .green[code] with the environment it needs
-* To package a whole .green[workflow] (*e.g.* to accompany a manuscript)
+1. Add the `conda:` directive to the rule:
+```{python conda rule, eval = FALSE}
+rule create_bwa_index:
+ output: ...
+ input: ...
+ conda: "envs/bwa.yaml" # ← Path to environment YAML file
+ shell:
+ "bwa index {input}"
-* And much more
+1. Run `snakemake --use-conda`
-## Docker vs. Singularity
+* Snakemake creates the environment for you and re-uses it next time
+* If the YAML file changes, the environment is re-created
+* `conda:` does not work if you use `run:` (instead of `shell:` or `script:`)
-* Docker was developed for .green[any operating system] except high-performance computing (HPC) clusters
+.tiny[modified from: [best practice example](]
-* Singularity is an open source container platform suitable for .green[HPC clusters]
-# Containers
-## Docker nomenclature
+# Using a "module" system
+* Conda environments can be large and slow to create
-* A Docker .green[file] is a recipe used to build a Docker .green[image]
+* Some cluster operators frown upon using it
-* A Docker .green[image] is a standalone executable package of software
-* A Docker .green[container] is a standard unit of software run on the Docker Engine
+* UPPMAX and other clusters have a .green[module] command for getting access to software:
+$ module load bioinfo-tools bwa
-* .green[DockerHub] is an online service for sharing Docker images
+* Snakemake supports this with the `envmodules:` directive:
+```{bash, eval = FALSE}
+rule create_bwa_index:
+ output: ...
+ input: ...
+ envmodules:
+ "bioinfo-tools",
+ "bwa",
+ conda: "envs/bwa.yaml" # ← Fallback
+ shell:
+ "bwa index {input}"
+* Run with `snakemake --use-envmodules`
-* Docker images can be converted into Singularity images
+* For reproducibility, [the Snakemake documentation recommends]( to also include a `conda:` section
# Containers
-## 1. Running Snakemake rules with Singularity
+* Containers represent another way of packaging applications
-* Snakemake can run a rule .green[isolated] in a container, using Singularity
+* Each container contains the application itself and .green[all system-level dependencies and libraries] (that is, a functional Linux installation)
-* All Conda packages are available as Docker and Singularity images, _e.g._ on (bioconda channel)
+* It is fully .green[isolated] from the other software on the machine:
+ By default, the tools in the container can only access what is in the container.
-* Many other Docker images are available on [DockerHub](
-* Or build your own Docker or Singularity images
+* The most common software for managing containers is .green[Docker]
# Containers
-## 1. Running Snakemake rules with Singularity
+## Docker nomenclature
-* Make sure your system has Singularity .green[installed]
+* A Docker .green[image] is a standalone executable package of software (on disk)
-* Find the Docker or Singularity .green[image] in which you want to run the rule
+* A .green[Dockerfile] is a recipe used to build a Docker .green[image]
-* Add the .green[link] to the container image (or the path to a Singularity `*.sif` file) to your rule using the `container` directive
-```{python singularity rule, eval = FALSE}
-rule NAME:
- input:
- "table.txt"
- output:
- "plots/myplot.pdf"
- container:
- "docker://joseespinosa/docker-r-ggplot2"
- script:
- "scripts/plot-stuff.R"
-.tiny[source: [Snakemake documentation](]
+* A Docker .green[container] is a standard unit of software run on the Docker Engine
+ (running an image gives a container)
-* Start your workflow on the command line with `--use-singularity`
-```{bash snakemake use singularity, eval=FALSE}
-$ snakemake --use-singularity
-# Containers
-## 2. Packaging your Snakemake workflow in a Docker container
+* .green[DockerHub] is an online service for sharing Docker images
-* Make sure your system has Docker .green[installed]
+## Docker vs Singularity
-* Write a .green[Docker file], _e.g._ [see this example](
+* On high-performance clusters (HPC), Docker is often not installed due to security concerns.
+ .green[Singularity] is often available as an alternative.
+* Docker images can be converted into Singularity images
- * Start with the official `Ubuntu` image
- * Install Miniconda and other required tools (_e.g._ Snakemake)
- * Add the project files (e.g. `Snakefile`, `config.yaml`, `environment.yaml`)
- * Install the Conda environment containing all packages run by the workflow
+* → Singularity can be used to run Docker containers
-# Containers
+# Running Snakemake jobs in containers
-## 2. Packaging your Snakemake workflow in a Docker container
+Snakemake can run a jobs in a container using Singularity
-* Create a Docker .green[image] from your Docker file (_e.g._ called `my_workflow`)
-```{bash docker image, eval=FALSE}
-$ docker build -t my_workflow .
+* Ensure your system has Singularity installed
-* .green[Run] your container, _e.g._
-```{bash docker run, eval=FALSE}
-$ docker run my_workflow
+* Find a Docker or Singularity image with the tool to run ( or [DockerHub](
-* .green[Share] your Docker file on GitHub or BitBucket, or your Docker image on DockerHub
-# Combinations of Conda and Containers
+* Add the `container:` directive to your rule:
-## Combine Conda-based package management with running jobs in containers
+```{python singularity rule, eval = FALSE}
+rule minimap2_version:
+ container: "docker://" # ← "docker://" is needed
+ shell:
+ "minimap2 --version"
-* A container can be specified globally (for the entire workflow) for a workflow
- with rule-specific Conda environments
-* Snakemake then runs each job in this container with its corresponding Conda
- environment when run with `--use-conda --use-singularity`
+* Start your workflow on the command line with `--use-singularity`
-.tiny[More info: [Snakemake documentation]( & [best practice example](]
+```{bash snakemake use singularity, eval=FALSE}
+$ snakemake --use-singularity -j 1
+Pulling singularity image docker://
+Activating singularity image .../.snakemake/singularity/342e6ddbac7e5929a11e6ae9350454c0.simg
+INFO: Converting SIF file to temporary sandbox...
+INFO: Cleaning up image...
-# Combinations of Conda and Containers
+# Containers – advanced topics
-## Containerization of Conda-based workflows
+* A [Docker image to use for *all* rules can be specified](
-* Snakemake can automatically generate a Docker file that contains all
- Conda environments used by the rules of the workflow using the flag `--containerize`
-.tiny[More info: [Snakemake documentation](]
-# Integrating foreign workflow management systems
-* From version 6.2 on, Snakemake can run workflows written in other workflow
- management systems such as .green[Nextflow]
+* You can package your entire workflow into a Docker image by writing a .green[Dockerfile].
+ [See this example](
+ - Snakemake runs *inside* the container.
+ - To run the workflow, only Docker or Singularity is needed
-* The workflow runs in .green[Snakemake] until a rule to run the foreign workflow is reached
-* In this rule, Snakemake .green[hands over] to the other workflow manager
-* Afterwards, .green[Snakemake] continues to run rules processing the output files of the foreign workflow
+* [Conda and containers can be combined]([Snakemake documentation]( Specify a global container, run with `--use-conda --use-singularity`, and Snakemake creates the Conda environment within the container.
-```{python nextflow, eval = FALSE}
-rule chipseq_pipeline:
- input:
- input="design.csv",
- fasta="data/genome.fasta",
- gtf="data/genome.gtf",
- output:
- "multiqc/broadPeaks/multiqc_report.html",
- params:
- pipeline="nf-core/chipseq",
- revision="1.2.1",
- profile=["conda"],
- handover: True
- wrapper:
- "0.74.0/utils/nextflow"
-.tiny[More info & source: [Snakemake documentation](]
+* [Snakemake can automatically generate a Dockerfile](
+ that contains all Conda environments used by the rules of the workflow using the flag
+ `--containerize`.
@@ -472,7 +428,7 @@ There are many ways to use other .green[tools for reproducible research] togethe
-* Use .green[Git] to version control, backup and share your code
+* Use .green[Git] for version control, backup and share your code
@@ -480,20 +436,20 @@ There are many ways to use other .green[tools for reproducible research] togethe
-* Run your rules in isolated Singularity .green[containers]
+* Run your rules in isolated Docker/Singularity .green[containers]
* Package your entire workflow in a .green[Docker container]
-* Run pipelines written in .green[other workflow management systems] in your Snakemake workflow
diff --git a/lectures/reproducibility-tools/reproducibility-tools.pdf b/lectures/reproducibility-tools/reproducibility-tools.pdf
index fe8990c..c2fa933 100644
+rule bwa_mem_CHS_HG00512:
+ output:
+ "bam/CHS.HG00512.bam"
+ input:
+ "resources/ref.fa",
+ "data/CHS.HG00512_1.fastq.gz",
+ "data/CHS.HG00512_2.fastq.gz",
+ shell:
+ "bwa mem -t 1 {input}"
+rule all:
+ input: expand("bam/{sample}.bam", sample=["CHS.HG00512", "PUR.HG00731"])
index 0000000..221bccd
--- /dev/null
+++ b/lectures/running_snakemake/running.qmd
@@ -0,0 +1,1067 @@
+title: Running snakemake
+subtitle: Running snakemake locally and on the cluster, finetuning performance and setting resource usage
+author: Per Unneberg
+date: "1 September, 2022"
+institute: NBIS
+from: markdown+emoji
+ revealjs:
+ theme:
+ - white
+ - ../custom.scss
+# css: ../revealjs.css
+ self-contained: false
+ toc: true
+ toc-depth: 1
+ slide-level: 2
+ slide-number: true
+ preview-links: true
+ chalkboard: true
+ # Multiple logos not possible; would need to make custom logo combining both logos
+ footer: Snakemake BYOC 2022 - Running snakemake
+ logo:
+ smaller: true
+ highlight-style: gruvbox
+ fig-height: 3
+ fig-width: 3
+ echo: true
+ warning: false
+ cache: false
+ include: true
+ autodep: true
+ eval: true
+ error: true
+ opts_chunk:
+ code-fold: false
+ tidy: false
+ fig-format: svg
+## Setup {.unnumbered .unlisted}
+```{r libs }
+#| echo: false
+#| eval: true
+#| cache: false
+# For some reason this is not applied to print statements
+## knitr::knit_hooks$set(inline = function(x) {
+## prettyNum(x, big.mark=",")
+## })
+bw <- theme_bw(base_size=24) %+replace% theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
+## knitr::knit_engines$set(snakemake = function(options) {
+## reticulate::eng_python(options)
+## })
+snakemake_version <- system("snakemake --version", intern=TRUE)
+- Examples with snakefiles and code to run in `examples` subdirectory
+- Snakefiles are named `ex#.smk`
+- Code has been run with Snakemake version `r snakemake_version`
+- Rules run `bwa mem` to map two samples to a reference
+:::: {.columns}
+::: {.column width="50%"}
+::: {.fragment}
+Input data:
+```{bash }
+#| echo: false
+#| label: list-data-0
+tree -L 3 data resources
+::: {.column width="50%"}
+::: {.fragment}
+```{bash }
+#| label: list-snakefiles
+#| echo: false
+ls -1 *.smk
+# Basic execution
+## Example 1 - ex1.smk
+Let's start by writing a snakefile that runs an alignment with `bwa`. The command we want to run is
+```{bash }
+#| label: bwa-command
+#| echo: true
+#| eval: false
+bwa mem -t 1 resources/ref.fa data/CHS.HG00512_1.fastq.gz data/CHS.HG00512_2.fastq.gz | samtools view -b -o bam/CHS.HG00512.bam
+where we have
+::: {.incremental}
+: `bam/CHS.HG00512.bam`
+: `resources/ref.fa`, `data/CHS.HG00512_1.fastq.gz`, and `data/CHS.HG00512_2.fastq.gz`
+: `bwa mem -t 1 {inputs} | samtools view -b -o {output}`
+::: {.fragment}
+Putting these in a snakefile yields:
+``` {python code=readLines("ex1.smk") }
+#| eval: false
+#| label: bwa_mem_CHS_HG00512-1
+## Example 1 - ex1.smk
+We first perform a dry run (option `--dry-run`, short option `-n`) to
+print (`--printshellcmds/-p`) the default rule in the snakefile
+`ex1.smk` which we point to using option `--snakefile/-s`:
+::: {.fragment}
+```{bash }
+#| label: snakemake-bwa_mem_CHS_HG00512_f1-1
+#| eval: false
+snakemake --snakefile ex1.smk --dry-run --printshellcmds
+## Example 1 - ex1.smk
+We first perform a dry run (option `--dry-run`, short option `-n`) to
+print (`--printshellcmds/-p`) the default rule in the snakefile
+`ex1.smk` which we point to using option `--snakefile/-s`:
+```{bash }
+#| label: snakemake-bwa_mem_CHS_HG00512_f1-2
+#| eval: false
+snakemake -s ex1.smk -n -p
+::: {.fragment}
+```{bash }
+#| label: snakemake-bwa_mem_CHS_HG00512_f2
+#| echo: false
+rm -f bam/CHS.HG00512.bam
+snakemake -s ex1.smk -n -p
+## Example 1 - ex1.smk
+We first perform a dry run (option `--dry-run`, short option `-n`) to
+print (`--printshellcmds/-p`) the default rule in the snakefile
+`ex1.smk` which we point to using option `--snakefile/-s`:
+```{bash }
+#| label: snakemake-bwa_mem_CHS_HG00512_f1-3
+#| eval: false
+snakemake -s ex1.smk -n -p
+```{bash }
+#| label: snakemake-bwa_mem_CHS_HG00512_f3
+#| echo: false
+rm -f bam/CHS.HG00512.bam
+snakemake -s ex1.smk -n -p
+Note the reason the rule was run[^reason]
+[^reason]: for snakemake <7.12 use the `--reason/-r` option
+::: {.notes}
+Mention the reason the rule was rerun
+## Example 1 - ex1.smk
+To actually run the workflow, we simply drop the `-p` and `-n` flags
+and add the number of cores (`--cores/-c`)[^cores] we want to utilize:
+```{bash }
+#| label: snakemake-rerun-workflow
+#| cache: true
+#| eval: false
+snakemake -s ex1.smk -c 1
+```{bash }
+#| label: snakemake-rerun-workflow-rm
+#| cache: true
+#| echo: false
+rm -f bam/*.bam
+snakemake -s ex1.smk -c 1
+[^cores]: Required for snakemake >= 5.11.0
+## Example 2 - ex2.smk
+The current snakefile consists only of one rule that also is specific
+to the `CHS.HG00512` sample. First, let's generalize the bwa rule
+using wildcards:
+``` {python code=readLines("ex2.smk") }
+#| label: bwa_mem_wildcard
+#| eval: false
+::: {.fragment}
+Now, running snakemake as before results in an error:
+::: {.fragment}
+``` {bash }
+#| eval: true
+#| echo: true
+snakemake -s ex2.smk -n -p
+::: {.fragment}
+As the error indicates, we could specify a target (e.g.
+`bam/PUR.HG00731.bam`; note the use of the `--quiet/-q` option to
+suppress information):
+``` {bash }
+snakemake -s ex2.smk -q -n bam/PUR.HG00731.bam
+::: {.fragment}
+Alternatively - and better - add a *pseudo-rule* (typically called
+`all`) at the *top* of the snakefile, since if no target is provided
+at the command line, snakemake will use the first rule it encounters.
+## Example 3 - ex3.smk
+With the previous in mind, the new snakefile becomes
+``` {python code=readLines("ex3.smk") }
+#| label: bwa_mem_wildcard_all
+#| eval: false
+#| code-line-numbers: "|1-2"
+::: {.fragment}
+Running the default target (implicitly `all`) gives:
+``` {bash }
+#| class-output: "scroll-300"
+snakemake -s ex3.smk -n -p
+::: {.fragment}
+Note that since sample `CHS.HG00512` had been processed previously,
+only one job is run.
+## Example 3 - ex3.smk: forcing reruns
+::: {.fragment}
+One can force regeneration of a single target with the `--force
+(-f)` option:
+```{bash }
+#| label: snakemake-force-rerun-target
+#| output-location: fragment
+#| class-output: "scroll-200"
+snakemake -s ex3.smk -f -c 1 -p -n bam/CHS.HG00512.bam
+::: {.fragment}
+To rerun the entire workflow use `--forceall/-F`:
+``` {bash }
+#| label: snakemake-force-rerun-all
+#| output-location: fragment
+#| class-output: "scroll-200"
+snakemake -s ex3.smk -F -c 1 -q -n
+::: {.fragment}
+::: {.callout-tip}
+Always first use the `--dry-run` together with `--forceall` so as not
+to inadvertently rerun everything from scratch.
+## Example 3 - ex3.smk: other handy options
+##### Rerun-incomplete and keep-going
+When resuming a workflow it can be handy to add the
+`--rerun-incomplete (--ri)` and `--keep-going (-k)` options:
+```{bash }
+#| label: snakemake-rerun-incomplete-workflow
+#| eval: false
+snakemake -s ex3.smk -c 2 --ri -k
+`--rerun-incomplete` takes care of unfinished jobs, e.g. slurm
+timeouts. If a job fails, `--keep-going` will try to finish as many
+jobs as possible before terminating the workflow.
+::: {.fragment}
+##### Printing the workflow dag and rulegraph
+:::: {.columns}
+::: {.column width="30%"}
+::: {.fragment}
+`--rulegraph` is a convenient way of getting an overview of the workflow
+``` {bash }
+#| eval: false
+snakemake -s ex3.smk --rulegraph | dot | display
+``` {bash }
+#| fig-format: svg
+#| output: asis
+#| echo: false
+ snakemake -s ex3.smk --rulegraph | dot -T svg | grep -v "
+::: {.notes}
+Point out that this can cause jobs to fail if they exceed the specified resources
+## Adding threads and resources to workflow - ex7.smk ##
+Now that we will be fine-tuning resources and threads per rule we add
+another rule `samtools_merge_bam` that will merge our bam files, the
+keyword `threads` and set resources for one of the rules:
+``` {python code=readLines("ex7.smk") }
+#| label: workflow-adding-threads
+#| echo: true
+#| eval: false
+#| code-line-numbers: "|12,24|13,25|1,2|21-23|26,27"
+::: {.fragment}
+```{bash }
+#| label: ex7-run
+#| echo: true
+#| eval: true
+snakemake -s ex7.smk -n -q -F -c 10
+::: {.fragment}
+Note that we changed the final pseudo-target name since the final
+workflow output now is a merged bam file!
+## Setting rule-specific resources ##
+Default resources are one-size-fits-all settings that would apply to
+all rules. However, in many workflows, there are certain rules that
+require more specific resource tuning.
+Resource tuning can be achieved with the `--set-resources` option.
+Similarly `--set-threads` allows setting rule-specific thread values:
+::: {.fragment}
+```{bash }
+#| label: snakemake-set-resources
+snakemake -F -n -s ex7.smk --default-resources mem_mb=2000 --set-resources bwa_mem_wildcard:runtime=1000 \
+ bwa_mem_wildcard:mem_mb=6000 --set-threads bwa_mem_wildcard=4 -c 8
+## Putting it all together - on the limits of the command line
+Putting everything together, we could now have a command line that
+looks something like this:
+::: {.fragment}
+```{bash }
+#| label: snakemake-long-command-line
+#| eval: false
+snakemake -s ex7.smk -F --ri -k \
+ --use-conda --use-singularity --use-envmodules \
+ --default-resources mem_mb=2000 --set-resources bwa_mem_wildcard:runtime=1000 \
+ bwa_mem_wildcard:mem_mb=6000 samtools_merge_bam:runtime=100 \
+ --set-threads bwa_mem_wildcard=4 -c 8
+::: {.fragment}
+This is getting illegible and it is tedious to write. What to do?
+::: {.fragment}
+Snakemake profiles to the rescue!
+# Snakemake profiles #
+## About ##
+Profiles are configuration files that apply to specific compute
+environments and analyses. They allow setting default options.
+::: {.fragment}
+At its simplest, a profile is simply a directory with a `config.yaml`
+file that sets program options. Let's put our previous example in a
+directory called `local` to represent a local profile. The minimum
+content of that directory is then a file `config.yaml` with (in this
+case) the following contents:
+```{r code=readLines("local/config.yaml") }
+#| label: snakemake-local-profile
+#| eval: false
+## Running the profile ##
+Run with `--profile` (NB: profile argument can also be absolute or
+relative path):
+```{bash }
+#| label: snakemake-local-profile-run
+snakemake -s ex7.smk --profile local -n -p -F -c 8
+# Cluster execution
+## Working on uppmax ##
+Sofar we have looked at local jobs. What if we want to submit jobs at
+a HPC? Here we focus on SLURM.
+::: {.fragment}
+##### sbatch solution #####
+Wrap workflow in sbatch script (e.g.
+```{bash }
+#| label: sbatch-script
+#| eval: false
+#!/bin/bash -l
+#SBATCH -A account
+#SBATCH -p core
+#SBATCH -c 20
+#... other SBATCH arguments ...
+module load snakemake
+snakemake -j 20 --use-conda --use-envmodules all
+and submit with
+```{bash }
+#| label: sbatch-submit
+#| eval: false
+::: {.fragment}
+Downside: can only make use of one node at a time.
+## The snakemake job scheduler
+When running jobs locally using limited number of threads, snakemake
+needs to decide what job to run when. These decisions are made by an
+internal *job scheduler*. As we will see, the internal scheduler still
+has this role when submitting jobs to a cluster scheduler.
+::: {.fragment}
+##### Background sessions #####
+A workflow can take a long time to run. A workflow submitted in a
+login shell will terminate once we logout. Therefore, it is advised to
+submit a workflow in a *background session*, using a so-called *terminal
+multiplexer* such as either
+[screen]( or
+A named `tmux` session can be initiated as
+```{bash }
+#| label: tmux
+#| eval: false
+tmux new -s mysession
+Inside the session, use a prefix (default `Ctrl-b`; many change to
+`Ctrl-a` which is default in `screen`) with key to launch `tmux`
+commands. For instance, `Ctrl-b d` will detach (exit) from the
+session. See the [tmux
+for further info.
+## Generic execution
+The `--cluster` option can be used to submit jobs to the cluster
+```{bash }
+#| label: snakemake-cluster-generic
+#| eval: false
+snakemake --cluster "sbatch -A account -p core -n 20 -t {resources.runtime}" \
+ --use-conda --use-envmodules --ri -k --default-resources runtime=100 -j 100
+Note the use of the format string "{resources.runtime}" to set running
+times individually.
+One drawback with this approach is that failed jobs or timeouts go
+undetected, which means you have to monitor the outputs regularly. You
+don't want to do that.
+::: {.fragment}
+##### Custom cluster commands #####
+The argument to `--cluster` is a command (sbatch in example above), so
+could be any wrapper script that submits jobs to a cluster scheduler.
+::: {.fragment}
+ Furthermore, option `--cluster-status` takes as argument a command
+(i.e. custom script) that checks jobs for their status.
+::: {.fragment}
+Also, option `--jobscript` takes as argument a script that submits
+jobs to the cluster.
+::: {.fragment}
+We could write custom scripts for each of these options to fine-tune
+job submission. If only there were such scripts already available!
+## snakemake-profiles
+[Snakemake Profiles]( are
+collections of reusable configuration profiles for various computing
+environments. The [slurm snakemake
+profile]( provides the
+scripts we requested on the previous slide.
+::: {.fragment}
+##### Installation #####
+The profiles are [cookiecutter
+templates]( and can be
+installed as follows:
+:::: {.columns}
+::: {.column width="50%"}
+::: {.fragment}
+```{bash }
+#| label: cookiecutter-profile-install
+#| eval: false
+$ cookiecutter
+profile_name [slurm]: myprofile
+Select use_singularity:
+1 - False
+2 - True
+Choose from 1, 2 [1]:
+Select use_conda:
+1 - False
+2 - True
+Choose from 1, 2 [1]:
+jobs [500]:
+restart_times [0]:
+max_status_checks_per_second [10]:
+max_jobs_per_second [10]:
+latency_wait [5]:
+Select print_shell_commands:
+1 - False
+2 - True
+Choose from 1, 2 [1]:
+sbatch_defaults []: --account=account
+cluster_sidecar_help [Use cluster sidecar. NB! Requires snakemake >= 7.0! Enter
+to continue...]:
+Select cluster_sidecar:
+1 - yes
+2 - no
+Choose from 1, 2 [1]:
+cluster_name []:
+cluster_jobname [%r_%w]:
+cluster_logpath [logs/slurm/%r/%j-%w]:
+cluster_config_help [The use of cluster-config is discouraged. Rather, set snakemake CLI options in the profile configuration file (see snakemake documentation
+on best practices). Enter to continue...]:
+cluster_config []:
+::: {.column width="50%"}
+::: {.fragment}
+`Profile directory contents`:
+```{bash }
+#| label: slurm-profile-tree
+#| echo: false
+tree myprofile | head -n -2
+## snakemake slurm profile
+:::: {.columns}
+::: {.column width="50%"}
+```{r code=readLines("myprofile/settings.json") }
+#| filename: "myprofile/settings.json"
+#| label: slurm-settings
+#| eval: false
+#| cache: false
+```{r code=readLines("myprofile/config.yaml") }
+#| filename: "myprofile/config.yaml"
+#| label: slurm-config
+#| eval: false
+::: {.column width="50%"}
+```{python code=readLines("myprofile/")[4:39]}
+#| filename: "myprofile/"
+#| label: slurm-cookiecutter-class
+#| eval: false
+::: {.fragment}
+##### Job submission #####
+Submit jobs with
+```{bash }
+#| label: slurm-profile-submit
+#| eval: false
+snakemake -s ex7.smk --profile myprofile -j 10 --ri -k -F
+## New features - time formatting and sbatch parameters
+Previously could set e.g. `partition` in rule:
+```{python }
+#| label: set-partition-in-rule
+#| echo: true
+#| eval: false
+rule bwa_mem:
+ resources:
+ time = "00:10:00",
+ mem = 12000,
+ partition = "devel"
+::: {.fragment}
+However, in many cases you would like to *constrain* on features
+defined on the HPC with the SLURM `--constraint` option. For example,
+UPPMAX defines the following features:
+```{bash }
+#| label: uppmax-sinfo-features
+#| echo: true
+#| eval: false
+sinfo -e -o "%P %m %c %.5a %f" | grep "ibsw2,\|ibsw16\|PARTITION" | grep "PARTITION\|node"
+```{bash }
+#| label: uppmax-sinfo-features-results
+#| echo: true
+#| eval: false
+node 256000 20 up fat,mem256GB,ibsw2,usage_mail
+node 128000 20 up thin,mem128GB,ibsw2,usage_mail
+node 1000000 20 up fat,mem1TB,mem256GB,mem512GB,ibsw16,usage_mail
+node 128000 20 up thin,mem128GB,ibsw16,usage_mail
+::: {.fragment}
+With the latest version of the slurm profile, you can do the following:
+```{python }
+#| label: set-constraint-in-rule
+#| echo: true
+#| eval: false
+rule gpu_stuff:
+ resources:
+ time="12h30m",
+ partition="node",
+ slurm="constraint=fat qos=gpuqos gres=gpu:2 gpus=4"
+# Questions? {.unnumbered .unlisted}