diff --git a/lectures/anatomy-of-a-rule/anatomy.html b/lectures/anatomy-of-a-rule/anatomy.html index 4bdadcc..89ef961 100644 --- a/lectures/anatomy-of-a-rule/anatomy.html +++ b/lectures/anatomy-of-a-rule/anatomy.html @@ -1,652 +1,2910 @@ - - - Anatomy of a Snakefile - - - - - - - - - - + + + + + + + + + + + Anatomy of a Snakefile + + + + + + + + + + + + + + + +
+
+ +
+

Anatomy of a Snakefile

+

Snakemake BYOC NBIS course

+ +
+
+ +

2024-05-27

+
+
+

Basic structure of a rule

+
rule:
+    output: "results/sample1.stats.txt"
+    shell:
+        """
+        echo -e "sample1\t50%" > {output}
+        """
+
+
+

Basic structure of a rule

+
$ snakemake -c 1
+Assuming unrestricted shared filesystem usage.
+Building DAG of jobs...
+Using shell: /bin/bash
+Provided cores: 1 (use --cores to define parallelism)
+Rules claiming more threads will be scaled down.
+Job stats:
+job      count
+-----  -------
+1            1
+total        1
+
+Select jobs to execute...
+Execute 1 jobs...
+
+[Fri May 17 23:47:24 2024]
+localrule 1:
+    output: results/sample1.stats.txt
+    jobid: 0
+    reason: Missing output files: results/sample1.stats.txt
+    resources: tmpdir=/var/folders/wb/jf9h8kw11b734gd98s6174rm0000gp/T
+
+[Fri May 17 23:47:24 2024]
+Finished job 0.
+1 of 1 steps (100%) done
+Complete log: .snakemake/log/2024-05-17T234724.252920.snakemake.log
+
+
+

Basic structure of a rule

+

More commonly, rules are named and have both input and output:

+
rule generate_stats:
+    output: "results/sample1.stats.txt"
+    input: "results/sample1.bam"
+    shell:
+        """
+        samtools flagstat {input} > {output}
+        """
+
+
+

Basic structure of a rule

+

Rules are linked by their input and output files:

+
rule a:
+    output: "a.txt"
+    shell:
+        "echo 'a' > a.txt"
+rule b:
+    input: "a.txt"
+    output: "b.txt"
+    shell:
+        "cat a.txt > b.txt"
+
+
+

Basic structure of a rule

+

You can also link rules explicitly:

+
rule a:
+    output: "a.txt"
+    shell:
+        "echo 'a' > a.txt"
+
+rule b:
+    input: rules.a.output
+    output: "b.txt"
+    shell:
+        "cat {input} > {output}"
+

but then the rule that supplies the file must be define before the rule that uses it.

+
+
+

Wildcards

+

Wildcards generalize a workflow. Imagine you have not just sample1 but samples 1..100.

+

Instead of writing 100 rules…

+
rule generate_stats_sample1:
+    output: "results/sample1.stats.txt"
+    input: "results/sample1.bam"
+...
+rule generate_stats_sample100:
+    output: "results/sample100.stats.txt"
+    input: "results/sample100.bam"
+
+
+

Wildcards

+

…we can introduce one or more wildcards which Snakemake can match to several text strings using regular expressions.

+

In our example, we replace the actual sample ids with the wildcard sample:

+
rule generate_stats:
+    output: "results/{sample}.stats.txt"
+    input: "results/{sample}.bam"
+    shell:
+        """
+        samtools flagstat {input} > {output}
+        """
+
+
+

Wildcards

+

Rules can have multiple wildcards…

+
rule generate_stats:
+    output: "results/{sample}_{lane}.stats.txt"
+    input: "results/{sample}_{lane}.bam"
+    shell:
+      """
+      samtools flagstats {input} > {output}
+      """
+
+
+

Wildcards

+

…but all the wildcards must be present in the output section.

+
+

Will work:

+
rule generate_stats:
+    output: "results/{sample}_{lane}.stats.txt"
+    input: "results/{sample}.bam"
+
+
+

Won’t work.

+
rule generate_stats:
+    output: "results/{sample}.stats.txt"
+    input: "results/{sample}_{lane}.bam"
+
Wildcards in input files cannot be determined from output files: 'lane'
+
+
+
+

Rule ambiguities

+

Ambiguities can arise when two rules produce the same output:

+
rule generate_stats:
+    output: "results/{sample}.stats.txt"
+    input: "results/{sample}.bam"
+    shell:
+        """
+        samtools flagstat {input} > {output}
+        """
+
+rule print_stats:
+    output: "results/{sample}.stats.txt"
+    input: "results/{sample}.log"
+    shell:
+        """
+        grep "% alignment" {input} > {output}
+        """
+      
+rule make_report:
+    output: "results/{sample}.report.pdf"
+    input: "results/{sample}.stats.txt"
+
+
+

Rule ambiguities

+
$ snakemake -c 1 -n results/sample1.report.pdf
+Building DAG of jobs...
+AmbiguousRuleException:
+Rules generate_stats and print_stats are ambiguous for the file 
+results/sample1.stats.txt.
+
+
+

Rule ambiguities

+

This can be handled in a number of ways:

+
+
    +
  • by changing the output file name of one of the rules
  • +
+
+
+
    +
  • or via the ruleorder directive:
  • +
+
ruleorder: generate_stats > print_stats
+
+
+
    +
  • or by specifically referring to the output of a certain rule:
  • +
+
rule make_report:
+    output: "results/{sample}.report.pdf"
+    input: rules.generate_stats.output
+
+
+
+

Logging

+

Logfiles and messages add descriptions and help with debugging:

+
rule generate_stats:
+    output: "results/{sample}.stats.txt"
+    input: "results/{sample}.bam"
+    log: "results/{sample}.flagstat.log"
+    message: "Generating stats for sample {wildcards.sample}"
+    shell:
+        """
+        samtools flagstat {input} > {output} 2>{log}
+        """
+
+
+
+
+ +
+

Tip

+
+
+

Log files are not deleted by snakemake if there’s an error.

+
+
+
+
+
+

Resources

+

Compute resources can be set with threads and resources:

+
rule generate_stats:
+    output: "results/{sample}.stats.txt"
+    input: "results/{sample}.bam"
+    log: "results/{sample}.flagstat.log"
+    message: "Generating stats for sample {wildcards.sample}"
+    threads: 4
+    resources:
+        mem_mb=100
+    shell:
+        """
+        samtools flagstat --threads {threads} {input} > {output} 2>{log}
+        """
+
+
+

Resources

+

It’s also possible to set threads based on the cores given to snakemake (e.g. --cores 8 or -c 8).

+
rule generate_stats:
+    output: "results/{sample}.stats.txt"
+    input: "results/{sample}.bam"
+    log: "results/{sample}.flagstat.log"
+    message: "Generating stats for sample {wildcards.sample}"
+    threads: workflow.cores * 0.5 
+    resources:
+        mem_mb=100
+    shell:
+        """
+        samtools flagstat --threads {threads} {input} > {output} 2>{log}
+        """
+
+
+

Resources

+

Resources can also be callables, allowing them to be set dynamically:

+
rule generate_stats:
+    output: "results/{sample}.stats.txt"
+    input: "results/{sample}.bam"
+    log: "results/{sample}.flagstat.log"
+    message: "Generating stats for sample {wildcards.sample}"
+    threads: workflow.cores * 0.5 
+    resources:
+        mem_mb=lambda wildcards: 1000 if wildcards.sample == "sample1-large" else 100
+    shell:
+        """
+        samtools flagstat --threads {threads} {input} > {output} 2>{log}
+        """
+
+
+

Parameters

+

Non-file rule parameters can be set with the params directive:

+
rule generate_stats:
+    output: "results/{sample}.stats.txt"
+    input: "results/{sample}.bam"
+    log: "results/{sample}.flagstat.log"
+    message: "Generating stats for sample {wildcards.sample}"
+    threads: workflow.cores * 0.5
+    resources:
+        mem_mb=100
+    params:
+        verbosity = 2
+    shell:
+        """
+        samtools flagstat --verbosity {params.verbosity} \
+          --threads {threads} {input} > {output} 2>{log}
+        """
+
+
+

Software environments

+

Software environments can be set for each rule using the conda: directive:

+
+
rule generate_stats:
+    output: "results/{sample}.stats.txt"
+    input: "results/{sample}.bam"
+    log: "results/{sample}.flagstat.log"
+    message: "Generating stats for sample {wildcards.sample}"
+    threads: workflow.cores * 0.5
+    resources:
+        mem_mb=100
+    params:
+        verbosity = 2
+    conda: "envs/samtools.yml"
+    shell:
+        """
+        samtools flagstat --verbosity {params.verbosity} \
+          --threads {threads} {input} > {output} 2>{log}
+        """
+
+
+
+

Software environments

+
+
rule generate_stats:
+    ...
+    conda: "envs/samtools.yml"
+    ...
+
+

Contents of envs/samtools.yml

+
name: samtools
+channels:
+  - bioconda
+dependencies:
+  - samtools=1.15.1
+
+

To make Snakemake use the conda environment, specify --software-deployment-method conda (or --sdm conda) on the command line. For Snakemake versions before 8.0, use --use-conda.

+
+
+
+

Software environments

+

On compute clusters, you can also specify packages to load with envmodules:

+
+
rule generate_stats:
+    output: "results/{sample}.stats.txt"
+    input: "results/{sample}.bam"
+    log: "results/{sample}.flagstat.log"
+    message: "Generating stats for sample {wildcards.sample}"
+    threads: workflow.cores * 0.5
+    resources:
+        mem_mb=100
+    params:
+        verbosity = 2
+    conda: "envs/samtools.yml"
+    envmodules: 
+        "bioinfo-tools",
+        "samtools"
+    shell:
+        """
+        samtools flagstat --verbosity {params.verbosity} \
+          --threads {threads} {input} > {output} 2>{log}
+        """
+
+
+
+

Software environments

+

On compute clusters, you can also specify packages to load with envmodules:

+
+
rule generate_stats:
+    ...
+    envmodules: 
+        "bioinfo-tools",
+        "samtools"
+    ...
+
+

To make Snakemake use envmodules, specify --use-envmodules on the command line.

+
+
+

Config files

+

Config files allow you to configure workflows without having to change the underlying code.

+
+

Config files should be in YAML or JSON format

+
+ +
+
+
samples: ["sample1", "sample2", "sample3"]
+verbosity: 2
+
+
+
{
+  "samples": [
+    "sample1",
+    "sample2",
+    "sample3"
+  ],
+  "verbosity": 2
+}
+
+
+
+
+
+
+

Config files

+

Specify one or more config files on the command line with:

+
snakemake --configfile config.yml -j 1
+
+

or directly in a snakefile, e.g.:

+
configfile: "config.yml""
+
+
+
+

Config files

+

The config parameters are available as a dictionary inside your snakefiles and can be accessed from within rules:

+
rule all:
+    input:
+        expand("results/{sample}.stats.txt", sample = config["samples"])
+
+rule generate_stats:
+    output: "results/{sample}.stats.txt"
+    input: "results/{sample}.bam"
+    log: "results/{sample}.flagstat.log"
+    message: "Generating stats for sample {wildcards.sample}"
+    threads: workflow.cores * 0.5
+    resources:
+        mem_mb=100
+    params:
+        verbosity = config["verbosity"]
+    conda: "envs/samtools.yml"
+    envmodules: 
+        "bioinfo-tools",
+        "samtools"
+    shell:
+        """
+        samtools flagstat --verbosity {params.verbosity} \
+          --threads {threads} {input} > {output} 2>{log}
+        """
+
+
+

What else?

+

Snakemake is constantly being updated with new features. Check out the documentation, and specifically the section about writing rules.

+
+
+

Questions?

+
+

+ +
+
+
+
+ + + + + + + + + - - - - + - - + + + + + + + + + \ No newline at end of file diff --git a/lectures/anatomy-of-a-rule/anatomy.pdf b/lectures/anatomy-of-a-rule/anatomy.pdf deleted file mode 100644 index 2cbd064..0000000 Binary files a/lectures/anatomy-of-a-rule/anatomy.pdf and /dev/null differ diff --git a/lectures/anatomy-of-a-rule/anatomy.Rmd b/lectures/anatomy-of-a-rule/anatomy.qmd similarity index 56% rename from lectures/anatomy-of-a-rule/anatomy.Rmd rename to lectures/anatomy-of-a-rule/anatomy.qmd index 4e98900..97f74ff 100644 --- a/lectures/anatomy-of-a-rule/anatomy.Rmd +++ b/lectures/anatomy-of-a-rule/anatomy.qmd @@ -1,33 +1,30 @@ --- title: "Anatomy of a Snakefile" subtitle: "Snakemake BYOC NBIS course" -date: "`r format(Sys.time(), '%d %B, %Y')`" -output: - xaringan::moon_reader: - self-contained: true - seal: false - css: ["default", "../template.css"] - nature: - slideNumberFormat: "" +date: 2024-05-27 +format: + revealjs: + theme: + - white + - ../custom.scss + # - ../revealjs.css + embed-resources: true + toc: false + toc-depth: 1 + slide-level: 2 + slide-number: true + #preview-links: true + #chalkboard: true + # Multiple logos not possible; would need to make custom logo combining both logos + footer: Snakemake BYOC 2024 - Reproducible Research + logo: https://nbis.se/nbislogo-green.svg + smaller: true + highlight-style: gruvbox --- -layout: true - - +## Basic structure of a rule ---- - -class: center, middle - -.HUGE[Anatomy of] -
-.HUGE[a Snakefile] -
- ---- -# Basic structure of a rule - -```python +```{.python} rule: output: "results/sample1.stats.txt" shell: @@ -35,39 +32,42 @@ rule: echo -e "sample1\t50%" > {output} """ ``` --- +## Basic structure of a rule -```bash -$ snakemake -c 1 results/sample1.stats.txt +```{.bash code-line-numbers="false"} +$ snakemake -c 1 +Assuming unrestricted shared filesystem usage. Building DAG of jobs... Using shell: /bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. -Job counts: - count jobs - 1 1 - 1 +Job stats: +job count +----- ------- +1 1 +total 1 + Select jobs to execute... +Execute 1 jobs... -[Tue Sep 28 11:50:45 2021] -rule 1: +[Fri May 17 23:47:24 2024] +localrule 1: output: results/sample1.stats.txt jobid: 0 + reason: Missing output files: results/sample1.stats.txt + resources: tmpdir=/var/folders/wb/jf9h8kw11b734gd98s6174rm0000gp/T -[Tue Sep 28 11:50:45 2021] +[Fri May 17 23:47:24 2024] Finished job 0. 1 of 1 steps (100%) done - -$ cat results/sample1.stats.txt -sample1 50% +Complete log: .snakemake/log/2024-05-17T234724.252920.snakemake.log ``` ---- -# Basic structure of a rule +## Basic structure of a rule More commonly, rules are named and have both input and output: -```python +```{.python} rule generate_stats: output: "results/sample1.stats.txt" input: "results/sample1.bam" @@ -76,17 +76,49 @@ rule generate_stats: samtools flagstat {input} > {output} """ ``` ---- -# Wildcards +## Basic structure of a rule + +Rules are linked by their input and output files: + +```{.python} +rule a: + output: "a.txt" + shell: + "echo 'a' > a.txt" +rule b: + input: "a.txt" + output: "b.txt" + shell: + "cat a.txt > b.txt" +``` + +## Basic structure of a rule + +You can also link rules explicitly: + +```{.python code-line-numbers="2,7"} +rule a: + output: "a.txt" + shell: + "echo 'a' > a.txt" + +rule b: + input: rules.a.output + output: "b.txt" + shell: + "cat {input} > {output}" +``` + +but then the rule that supplies the file must be define before the rule that uses it. -.green[Wildcards] generalize a workflow. Imagine you have not just sample1 but samples 1..100. +## Wildcards {auto-animate=true} --- +Wildcards generalize a workflow. Imagine you have not just sample1 but samples 1..100. Instead of writing 100 rules... -```python +```{.python code-line-numbers="false"} rule generate_stats_sample1: output: "results/sample1.stats.txt" input: "results/sample1.bam" @@ -96,9 +128,9 @@ rule generate_stats_sample100: input: "results/sample100.bam" ``` --- +## Wildcards {auto-animate=true} -...we can introduce one or more .green[wildcards] which Snakemake can match to several text strings using regular expressions. +...we can introduce one or more `wildcards` which Snakemake can match to several text strings using regular expressions. In our example, we replace the actual sample ids with the wildcard `sample`: @@ -112,13 +144,12 @@ rule generate_stats: samtools flagstat {input} > {output} """ ``` ---- -# Wildcards +## Wildcards Rules can have multiple wildcards... -```python +```{.python} rule generate_stats: output: "results/{sample}_{lane}.stats.txt" input: "results/{sample}_{lane}.bam" @@ -128,35 +159,37 @@ rule generate_stats: """ ``` --- +## Wildcards -...but wildcards **must** be present in the output section. +...but all the wildcards **must** be present in the output section. +::: {.fragment} **Will work:** -```python +```{.python} rule generate_stats: output: "results/{sample}_{lane}.stats.txt" input: "results/{sample}.bam" ``` +::: --- - -**Won't work:** -```python +::: {.fragment} +Won't work. +```{.python} rule generate_stats: output: "results/{sample}.stats.txt" input: "results/{sample}_{lane}.bam" ``` -```bash + +```{.bash code-line-numbers="false"} Wildcards in input files cannot be determined from output files: 'lane' ``` +::: ---- -# Rule ambiguities +## Rule ambiguities {auto-animate=true} Ambiguities can arise when two rules produce the same output: -```python +```{.python} rule generate_stats: output: "results/{sample}.stats.txt" input: "results/{sample}.bam" @@ -177,44 +210,46 @@ rule make_report: output: "results/{sample}.report.pdf" input: "results/{sample}.stats.txt" ``` --- -```bash + +## Rule ambiguities {auto-animate=true} + +```{.bash code-line-numbers="false"} $ snakemake -c 1 -n results/sample1.report.pdf Building DAG of jobs... AmbiguousRuleException: -Rules generate_stats and print_stats are ambiguous for the file results/sample1.stats.txt. +Rules generate_stats and print_stats are ambiguous for the file +results/sample1.stats.txt. ``` ---- -# Rule ambiguities - -Ambiguities can arise when two rules produce the same output: +## Rule ambiguities This can be handled in a number of ways: +::: {.fragment} - by changing the output file name of one of the rules +::: --- - +::: {.fragment} - or via the `ruleorder` directive: -```python +```{.python} ruleorder: generate_stats > print_stats ``` +::: --- - +::: {.fragment} - or by specifically referring to the output of a certain rule: -```python +```{.python} rule make_report: output: "results/{sample}.report.pdf" input: rules.generate_stats.output ``` +::: ---- -# Logging +## Logging Logfiles and messages add descriptions and help with debugging: -```python + +```{.python} rule generate_stats: output: "results/{sample}.stats.txt" input: "results/{sample}.bam" @@ -225,15 +260,17 @@ rule generate_stats: samtools flagstat {input} > {output} 2>{log} """ ``` +:::{.callout-tip} Log files are not deleted by snakemake if there's an error. ---- -# Resources +::: -Compute resources can be set with .green[threads] and .green[resources]: +## Resources {auto-animate=true} -```python +Compute resources can be set with **threads** and **resources**: + +```{.python} rule generate_stats: output: "results/{sample}.stats.txt" input: "results/{sample}.bam" @@ -247,18 +284,18 @@ rule generate_stats: samtools flagstat --threads {threads} {input} > {output} 2>{log} """ ``` ---- -# Resources -Compute resources can be set with .green[threads] and .green[resources]: +## Resources {auto-animate=true} -```python +It's also possible to set threads based on the cores given to snakemake (_e.g._ `--cores 8` or `-c 8`). + +```{.python code-line-numbers="6"} rule generate_stats: output: "results/{sample}.stats.txt" input: "results/{sample}.bam" log: "results/{sample}.flagstat.log" message: "Generating stats for sample {wildcards.sample}" - threads: workflow.cores * 0.5 # <---- threads as a function of workflow cores + threads: workflow.cores * 0.5 resources: mem_mb=100 shell: @@ -267,17 +304,30 @@ rule generate_stats: """ ``` -It's also possible to set threads based on the cores given to snakemake (_e.g._ `--cores 8` or `-c 8`). +## Resources {auto-animate=true} --- +Resources can also be callables, allowing them to be set dynamically: -More on resources in a lecture tomorrow. +```{.python code-line-numbers="8"} +rule generate_stats: + output: "results/{sample}.stats.txt" + input: "results/{sample}.bam" + log: "results/{sample}.flagstat.log" + message: "Generating stats for sample {wildcards.sample}" + threads: workflow.cores * 0.5 + resources: + mem_mb=lambda wildcards: 1000 if wildcards.sample == "sample1-large" else 100 + shell: + """ + samtools flagstat --threads {threads} {input} > {output} 2>{log} + """ +``` ---- -# Parameters +## Parameters {auto-animate=true auto-animate-restart=true} -Rule parameters can be set with the .green[params] directive: -```python +Non-file rule parameters can be set with the **params** directive: + +```{.python} rule generate_stats: output: "results/{sample}.stats.txt" input: "results/{sample}.bam" @@ -294,12 +344,13 @@ rule generate_stats: --threads {threads} {input} > {output} 2>{log} """ ``` ---- -# Software environments -Software environments can be set for each rule using `conda:`: +## Software environments {auto-animate=true} -```python +Software environments can be set for each rule using the `conda:` directive: + +::: {data-id="code1"} +```{.python code-line-numbers="11"} rule generate_stats: output: "results/{sample}.stats.txt" input: "results/{sample}.bam" @@ -317,9 +368,22 @@ rule generate_stats: --threads {threads} {input} > {output} 2>{log} """ ``` +::: + +## Software environments {auto-animate=true auto-animate-easing=None} + +::: {data-id="code1"} +```{.python code-line-numbers="3"} +rule generate_stats: + ... + conda: "envs/samtools.yml" + ... +``` +::: + +Contents of `envs/samtools.yml` ```yaml -### Contents of envs/samtools.yml ### name: samtools channels: - bioconda @@ -327,14 +391,16 @@ dependencies: - samtools=1.15.1 ``` -To make Snakemake use the conda environment, specify `--use-conda` on the command line. +:::{.fragment} +To make Snakemake use the conda environment, specify `--software-deployment-method conda` (or `--sdm conda`) on the command line. For Snakemake versions before 8.0, use `--use-conda`. +::: ---- -# Software environments +## Software environments {auto-animate=true auto-animate-restart=true} -Or by using `envmodules`, _e.g._ in compute clusters: +On compute clusters, you can also specify packages to load with `envmodules:` -```python +::: {data-id="code2"} +```{.python code-line-numbers="12-14"} rule generate_stats: output: "results/{sample}.stats.txt" input: "results/{sample}.bam" @@ -355,30 +421,45 @@ rule generate_stats: --threads {threads} {input} > {output} 2>{log} """ ``` +::: + +## Software environments {auto-animate=true auto-animate-easing=None auto-animate-delay=0} + +On compute clusters, you can also specify packages to load with `envmodules:` + +::: {data-id="code2"} +```{.python code-line-numbers="3-5"} +rule generate_stats: + ... + envmodules: + "bioinfo-tools", + "samtools" + ... +``` +::: To make Snakemake use envmodules, specify `--use-envmodules` on the command line. --- +## Config files -More on conda and envmodules tomorrow. +**Config files** allow you to configure workflows without having to change the underlying code. ---- -# Config files +:::{.fragment} -.green[Config files] allow you to configure workflows without having to change the underlying code. +Config files should be in `YAML` or `JSON` format --- +::: {.panel-tabset} -.small[Config files should be in `YAML` or `JSON` format:] +### YAML -```yaml -### Contents of config.yml ### +```{.yaml code-line-numbers="false"} samples: ["sample1", "sample2", "sample3"] verbosity: 2 ``` -```json -### Contents of config.json ### +### JSON + +```{.json code-line-numbers="false"} { "samples": [ "sample1", @@ -388,59 +469,32 @@ verbosity: 2 "verbosity": 2 } ``` +::: +::: --- +## Config files -.small[Specify one or more config files on the command line with:] -```bash +Specify one or more config files on the command line with: + +```{.bash code-line-numbers="false"} snakemake --configfile config.yml -j 1 ``` --- - -.small[Or directly in a snakefile, _e.g._:] -```python +:::{.fragment} +or directly in a snakefile, _e.g._: +```{.python code-line-numbers="false"} configfile: "config.yml"" ``` +::: ---- -# Config files - -The config parameters are available as a dictionary inside your snakefiles: - -```yaml -### Contents of config.yml ### -samples: ["sample1", "sample2", "sample3"] -verbosity: 2 -``` - -```python -configfile: "config.yml" -print(config) -{'samples': ['sample1', 'sample2', 'sample3'], 'verbosity': 2} -``` - --- +## Config files -This allows you to manipulate the config variable inside your snakemake files and python code. - -```python -config["samples"].append("sample4") -print(config) -{'samples': ['sample1', 'sample2', 'sample3', 'sample4'], 'verbosity': 2} -``` - ---- -# Config files - -In our example rule, verbosity level can be controlled via a config file like this: - -```python -configfile: "config.yml" +The config parameters are available as a dictionary inside your snakefiles and can be accessed from within rules: +```{.python code-line-numbers="3,14"} rule all: - input: - expand("results/{sample}.stats.txt", sample = config["samples"]) + input: + expand("results/{sample}.stats.txt", sample = config["samples"]) rule generate_stats: output: "results/{sample}.stats.txt" @@ -451,7 +505,7 @@ rule generate_stats: resources: mem_mb=100 params: - verbosity = config["verbosity"] # <----- + verbosity = config["verbosity"] conda: "envs/samtools.yml" envmodules: "bioinfo-tools", @@ -462,87 +516,9 @@ rule generate_stats: --threads {threads} {input} > {output} 2>{log} """ ``` ---- -# Config files - -Config files are also convenient for defining what the workflow will do. - --- - -If no targets are given on the command line, Snakemake will run the first rule specified. - -By convention this rule is named `all` and is used as a 'pseudo-rule' to define what the workflow will generate. - -```python -rule all: - input: - "results/sample1.stats.txt", - "results/sample2.stats.txt", - "results/sample3.stats.txt" - -rule generate_stats: - output: "results/{sample}.stats.txt" - input: "results/{sample}.bam" - log: "results/{sample}.flagstat.log" -... -``` ---- -# Config files - -Config files are also convenient for defining what the workflow will do. - -If no targets are given on the command line, Snakemake will run the first rule specified. - -By convention this rule is named `all` and is used as a 'pseudo-rule' to define what the workflow will generate. - -```python -samples = ["sample1", "sample2", "sample3"] -rule all: - input: - expand("results/{sample}.stats.txt", sample = config["samples"]) - -rule generate_stats: - output: "results/{sample}.stats.txt" - input: "results/{sample}.bam" - log: "results/{sample}.flagstat.log" -... -``` - -If we define a list of samples we can condense the input section of the `all` rule using the .green[expand] function. +## What else? ---- -# Config files - -By defining samples in the config file (either directly or via a sample list that is read and stored in the config dictionary), your workflow becomes way more flexible. - -```yaml -### Contents of config.yml ### -samples: ["sample1", "sample2", "sample3"] -verbosity: 2 -``` - -```python -configfile: "config.yml" -rule all: - input: - expand("results/{sample}.stats.txt", sample = config["samples"]) - -rule generate_stats: - output: "results/{sample}.stats.txt" - input: "results/{sample}.bam" - log: "results/{sample}.flagstat.log" -... -``` - ---- - -# What else? - -Snakemake is constantly being updated with new features. Check out the documentation (https://snakemake.readthedocs.io/), and specifically the section about [writing rules](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html). - - ---- +Snakemake is constantly being updated with new features. Check out the [documentation](https://snakemake.readthedocs.io/), and specifically the section about [writing rules](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html). -class: center, middle -# Questions? \ No newline at end of file +## Questions? \ No newline at end of file diff --git a/lectures/reproducibility-tools/Makefile b/lectures/reproducibility-tools/Makefile index 55cee41..202ea79 100644 --- a/lectures/reproducibility-tools/Makefile +++ b/lectures/reproducibility-tools/Makefile @@ -1,7 +1,7 @@ all: reproducibility-tools.html -%.html: %.Rmd - Rscript -e 'rmarkdown::render("$<")' +%.html: %.qmd + quarto render $< # OPENSSL_CONF due to https://github.com/nodejs/node/issues/43132#issuecomment-1130503287 %.pdf: %.html diff --git a/lectures/reproducibility-tools/reproducibility-tools.Rmd b/lectures/reproducibility-tools/reproducibility-tools.Rmd deleted file mode 100644 index ba0669b..0000000 --- a/lectures/reproducibility-tools/reproducibility-tools.Rmd +++ /dev/null @@ -1,455 +0,0 @@ ---- -title: "Reproducible Research and Snakemake" -subtitle: "Snakemake BYOC NBIS course" -date: "`r format(Sys.time(), '%d %B, %Y')`" -output: - xaringan::moon_reader: - self-contained: true - seal: false - css: ["default", "../template.css"] - nature: - slideNumberFormat: "" ---- - -layout: true - - - ---- - -class: center, middle - -.HUGE[Reproducible Research and Snakemake] - -```{r Setup, echo = FALSE, message = FALSE} -# Knitr setup -knitr::opts_chunk$set(message = FALSE, - warning = FALSE) - -# Load packages -library("dplyr") -library("kableExtra") -``` - ---- - -# Reproducibility - -- Reproducible research is about being able to replicate the results of a study -- It is an important aspect of the scientific method -- **Computational reproducibility** is one part of it -- Ideally, given the **same data** and the **same code**, there are identical outcomes - --- - -*Code* encompasses -- The workflow itself (→ `Snakefile`) -- The helper scripts you are calling (→ `scripts/`) -- The 3rd-party tools you are running/the execution environment (→ this lecture) - - ---- - -# Computational reproducibility - -Why the effort? - -.tiny[M. Schwab et al. *Making scientific computations reproducible*. https://dx.doi.org/10.1109/5992.881708] - -> Because many researchers typically forget details -> of their own work, they are not unlike strangers -> when returning to projects after time away. -> Thus, efforts to communicate your work to -> strangers can actually help you communicate -> with yourself over time. - --- - -→ **You** are part of the target audience - - ---- - -# Don’t be *that* person - -*Science* implemented a replication policy in 2011. -A study in 2018 requested raw data and code in accordance with the policy. -Some answers: - --- -> When you approach a PI for the source codes and raw data, you better explain who you are, -> whom you work for, why you need the data and what you are going to do with it. - -  - --- - -> I have to say that this is a very unusual request without any explanation! -> Please ask your supervisor to send me an email with a detailed, and I mean detailed, explanation. - --- - -(26% out of 204 randomly selected papers in the journal could be reproduced.) - -.tiny[Stodden et. al (2018). *An empirical analysis of journal policy effectiveness for computational reproducibility* https://doi.org/10.1073/pnas.1708290115] - ---- - -# Combine tools to make research reproducible - -.center[] - --- - -* Track code changes over time with .green[Git] and share it on [GitHub](https://github.com) (not this talk) - --- - -* Make your workflow reproducible with a workflow manager (.green[Snakemake], .green[Nextflow], .green[WDL]) - --- - -* Make the execution environment reproducible with .green[Conda] environments and/or .green[containers] - - ---- - -# Conda: a .green[package], .green[dependency], and .green[environment] manager - -* Conda installs packages -* Packages come from a central repository at https://anaconda.org/ -* Users can contribute their own packages via *channels* -* Highly recommended: The [Bioconda](https://bioconda.github.io/) channel - ---- - -# Using Conda - -* Install Conda, for example with [Miniconda](https://docs.conda.io/en/latest/miniconda.html) - -* Set up the [Bioconda](https://bioconda.github.io/) channel - --- - -* Install Samtools and BWA into a new **Conda environment** named `mapping`: -```{bash, eval=FALSE} -$ conda create -n mapping samtools bwa -``` - --- - -* Conda also installs all .green[dependencies] – other software required by Samtools and/or BWA. - --- - -To use the tools in the environment, .green[activate] it: -```{bash, eval=FALSE} -$ conda activate mapping -$ samtools --version -samtools 1.15.1 -``` - --- -* Install a tool into an existing environment: -```{bash, eval=FALSE} -conda install -n mapping bowtie2 -``` -(Leaving out `-n mapping` installs into the currently active environment.) - ---- - -# Conda environments - -* You can have as many environments as you wish - --- - -* Environments are independent - --- - -* If something is broken, simply delete the environment and start over - --- - -```{bash, eval=FALSE} -$ conda env remove -n mapping -``` - --- - -* To test a new tool, install it into a fresh Conda environment. Delete the environment to uninstall. - --- - -* Find packages by searching [anaconda.org](https://anaconda.org) or with `conda search` - - ---- - -# Conda environment files - -* Conda environments can be created from .green[environment files] in YAML format. - --- - -* Example `bwa.yaml`: - -```{yaml conda env one, eval = FALSE} -channels: - - conda-forge - - bioconda - - defaults -dependencies: - - bwa=0.7.17 -``` - --- -* Create the environment: -```{bash, eval = FALSE} -$ conda env create -n bwa -f bwa.yaml -``` - ---- - -# Snakemake + Conda - -## Option one: A single environment for the entire workflow - -* Write an environment file (`environment.yaml`) that includes .green[all tools used by the workflow]: -```{python conda env big, eval=FALSE} -name: best-practice-smk -channels: - - conda-forge - - bioconda - - default -dependencies: - - snakemake=6.8.0 # ← Snakemake is part of the environment -... - - multiqc=1.11 # ← Version numbers for reproducibility - - samtools=1.13 -``` - --- -* Create the environment, activate it and run the workflow within it: -```{bash snakemake conda env, eval=FALSE} -$ conda env create -f environment.yml -$ conda activate best-practice-smk -$ snakemake -``` - --- -* Possibly helpful: `conda export -n envname > environment.yaml` - -.tiny[source: [best practice example](https://github.com/NBISweden/snakemake_best_practice)] - ---- -# Snakemake + Conda - -## Option two: Rule-specific environments - -You can let Snakemake create and activate Conda environments for you. - --- -1. Create the environment file, such as `envs/bwa.yaml` (`envs/` is best practice) - --- -1. Add the `conda:` directive to the rule: -```{python conda rule, eval = FALSE} -rule create_bwa_index: - output: ... - input: ... - conda: "envs/bwa.yaml" # ← Path to environment YAML file - shell: - "bwa index {input}" -``` --- -1. Run `snakemake --use-conda` - --- - -* Snakemake creates the environment for you and re-uses it next time -* If the YAML file changes, the environment is re-created -* `conda:` does not work if you use `run:` (instead of `shell:` or `script:`) - - -.tiny[modified from: [best practice example](https://github.com/NBISweden/snakemake_best_practice)] - - ---- - -# Using a "module" system - -* Conda environments can be large and slow to create - -* Some cluster operators frown upon using it - --- - -* UPPMAX and other clusters have a .green[module] command for getting access to software: -``` -$ module load bioinfo-tools bwa -``` - --- - -* Snakemake supports this with the `envmodules:` directive: -```{bash, eval = FALSE} -rule create_bwa_index: - output: ... - input: ... - envmodules: - "bioinfo-tools", - "bwa", - conda: "envs/bwa.yaml" # ← Fallback - shell: - "bwa index {input}" -``` - -* Run with `snakemake --use-envmodules` - -* For reproducibility, [the Snakemake documentation recommends](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#using-environment-modules) to also include a `conda:` section - ---- - -# Containers - -* Containers represent another way of packaging applications - --- - -* Each container contains the application itself and .green[all system-level dependencies and libraries] (that is, a functional Linux installation) - --- - -* It is fully .green[isolated] from the other software on the machine: - By default, the tools in the container can only access what is in the container. - --- - -* The most common software for managing containers is .green[Docker] - ---- - -# Containers - -## Docker nomenclature - --- -* A Docker .green[image] is a standalone executable package of software (on disk) - --- -* A .green[Dockerfile] is a recipe used to build a Docker .green[image] - --- -* A Docker .green[container] is a standard unit of software run on the Docker Engine - (running an image gives a container) - --- -* .green[DockerHub] is an online service for sharing Docker images - --- - -## Docker vs Singularity - -* On high-performance clusters (HPC), Docker is often not installed due to security concerns. - .green[Singularity] is often available as an alternative. - --- -* Docker images can be converted into Singularity images - --- -* → Singularity can be used to run Docker containers - ---- - -# Running Snakemake jobs in containers - -Snakemake can run a jobs in a container using Singularity - -* Ensure your system has Singularity installed - --- - -* Find a Docker or Singularity image with the tool to run (https://biocontainers.pro/ or [DockerHub](https://hub.docker.com/)) - --- - -* Add the `container:` directive to your rule: - -```{python singularity rule, eval = FALSE} -rule minimap2_version: - container: "docker://quay.io/biocontainers/minimap2:2.24--h5bf99c6_0" # ← "docker://" is needed - shell: - "minimap2 --version" -``` - --- - -* Start your workflow on the command line with `--use-singularity` - -```{bash snakemake use singularity, eval=FALSE} -$ snakemake --use-singularity -j 1 -... -Pulling singularity image docker://quay.io/biocontainers/minimap2:2.24--h5bf99c6_0. -... -Activating singularity image .../.snakemake/singularity/342e6ddbac7e5929a11e6ae9350454c0.simg -INFO: Converting SIF file to temporary sandbox... -2.24-r1122 -INFO: Cleaning up image... -... -``` - ---- - -# Containers – advanced topics - -* A [Docker image to use for *all* rules can be specified](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#running-jobs-in-containers) - --- -* You can package your entire workflow into a Docker image by writing a .green[Dockerfile]. - [See this example](https://github.com/NBISweden/workshop-reproducible-research/blob/0ee1eca78ccefbd06fbeb2c0aba37030230df90d/tutorials/containers/Dockerfile) - - Snakemake runs *inside* the container. - - To run the workflow, only Docker or Singularity is needed - --- -* [Conda and containers can be combined]([Snakemake documentation](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#ad-hoc-combination-of-conda-package-management-with-containers): Specify a global container, run with `--use-conda --use-singularity`, and Snakemake creates the Conda environment within the container. - --- -* [Snakemake can automatically generate a Dockerfile](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#containerization-of-conda-based-workflows) - that contains all Conda environments used by the rules of the workflow using the flag - `--containerize`. - ---- - -# Summary - -There are many ways to use other .green[tools for reproducible research] together with Snakemake: - --- - -* Use .green[Git] for version control, backup and share your code - --- - -* Run rules or your entire workflow in .green[Conda] environments - --- - -* Run your rules in isolated Docker/Singularity .green[containers] - --- - -* Package your entire workflow in a .green[Docker container] - - - diff --git a/lectures/reproducibility-tools/reproducibility-tools.qmd b/lectures/reproducibility-tools/reproducibility-tools.qmd new file mode 100644 index 0000000..811c935 --- /dev/null +++ b/lectures/reproducibility-tools/reproducibility-tools.qmd @@ -0,0 +1,406 @@ +--- +title: "Reproducible Research and Snakemake" +subtitle: "Snakemake BYOC NBIS course" +date: 2024-05-28 +format: + revealjs: + theme: + - white + - ../custom.scss + self-contained: false + toc: false + toc-depth: 1 + slide-level: 2 + slide-number: true + #preview-links: true + #chalkboard: true + # Multiple logos not possible; would need to make custom logo combining both logos + footer: Snakemake BYOC 2024 - Reproducible Research + logo: https://nbis.se/nbislogo-green.svg + smaller: true + highlight-style: gruvbox + fig-height: 3 + fig-width: 3 + code-line-numbers: false +execute: + echo: true + #warning: false + #cache: false + #include: true + #autodep: true + eval: false + #error: true + +--- + +## Reproducibility + +- Reproducible research is about being able to replicate the results of a study +- It is an important aspect of the scientific method +- **Computational reproducibility** is one part of it +- Ideally, given the **same data** and the **same code**, there are identical outcomes + +. . . + +### Code encompasses + +- The workflow itself (→ `Snakefile`) +- The helper scripts you are calling (→ `scripts/`) +- The 3rd-party tools you are running/the execution environment (→ this lecture) + + +## Computational reproducibility + +Why the effort? + +> Because many researchers typically forget details +> of their own work, they are not unlike strangers +> when returning to projects after time away. +> Thus, efforts to communicate your work to +> strangers can actually help you communicate +> with yourself over time. + +M. Schwab et al. +*Making scientific computations reproducible*.\ + + +. . . + +→ **You** are part of the target audience + + +## Don’t be *that* person + +* *Science* implemented a replication policy in 2011. +* A study in 2018 (Stodden et. al, ) +requested raw data and code in accordance with the policy. +* Some answers: + +. . . + +> When you approach a PI for the source codes and raw data, you better explain who you are, +> whom you work for, why you need the data and what you are going to do with it. + +. . . + +> I have to say that this is a very unusual request without any explanation! +> Please ask your supervisor to send me an email with a detailed, and I mean detailed, explanation. + +. . . + +26% out of 204 randomly selected papers in the journal could be reproduced. + + +## Combining tools to make research reproducible + +![](reproducibility-overview.png){fig-align="center"} + +::: {.incremental} + +* Track code changes over time with **Git** and share it on [GitHub](https://github.com) (not this talk) +* Make your workflow reproducible with a workflow manager (**Snakemake**, **Nextflow**, **WDL**) +* Make the execution environment reproducible with **Conda** environments and/or **containers** + +::: + +## Conda: a **package**, **dependency**, and **environment** manager + +* Conda installs packages +* Packages come from a central repository at +* Users can contribute their own packages via *channels* +* Highly recommended: The [Bioconda](https://bioconda.github.io/) channel + + +## Using Conda + +::: {.incremental} + +* Install Conda (through [Miniconda](https://docs.anaconda.com/free/miniconda/)) +* Set up the [Bioconda](https://bioconda.github.io/) channel +* Install Samtools and BWA into a new **Conda environment** named `mapping`: + + ```bash + $ conda create -n mapping samtools bwa + ``` + +* Conda also installs all **dependencies** – other software required by Samtools and/or BWA. +To use the tools in the environment, **activate** it: + + ```bash + $ conda activate mapping + $ samtools --version + samtools 1.15.1 + ``` + +* Install a tool into an existing environment: + + ```bash + conda install -n mapping bowtie2 + ``` + (Leaving out `-n mapping` installs into the currently active environment.) + +::: + + +## Conda environments + +::: {.incremental} + +* You can have as many environments as you wish +* Environments are independent +* If something is broken, simply delete the environment and start over + + ```bash + $ conda env remove -n mapping + ``` + +* To test a new tool, install it into a fresh Conda environment. Delete the environment to uninstall. + +* Find packages by searching [anaconda.org](https://anaconda.org) or with `conda search` + +::: + + +## Conda environment files + +::: {.incremental} + +* Conda environments can be created from **environment files** in YAML format. +* Example `bwa.yaml`: + + ```yaml + channels: + - conda-forge + - bioconda + - defaults + dependencies: + - bwa=0.7.17 + ``` + +* Create the environment: + + ```bash + $ conda env create -n bwa -f bwa.yaml + ``` + +::: + + +## Snakemake + Conda + +### Option one: A single environment for the entire workflow + +* Write an environment file (`environment.yaml`) that includes **all tools used by the workflow**: + + ```yaml + name: best-practice-smk + channels: + - conda-forge + - bioconda + - default + dependencies: + - snakemake=6.8.0 # ← Snakemake is part of the environment + ... + - multiqc=1.11 # ← Version numbers for reproducibility + - samtools=1.13 + ``` + +. . . + +* Create the environment, activate it and run the workflow within it: + + ```bash + $ conda env create -f environment.yml + $ conda activate best-practice-smk + $ snakemake + ``` + +. . . + +* Possibly helpful: `conda export -n envname > environment.yaml` + +source: [best practice example](https://github.com/NBISweden/snakemake_best_practice) + + +## Snakemake + Conda + +### Option two: Rule-specific environments + +You can let Snakemake create and activate Conda environments for you. + +::: {.incremental} + +* Create the environment file, such as `envs/bwa.yaml` (`envs/` is best practice) +* Add the `conda:` directive to the rule: + + ```{python conda rule, eval = FALSE} + rule create_bwa_index: + output: ... + input: ... + conda: "envs/bwa.yaml" # ← Path to environment YAML file + shell: + "bwa index {input}" + ``` +* Run `snakemake --use-conda` +* Snakemake creates the environment for you and re-uses it next time +* If the YAML file changes, the environment is re-created +* `conda:` does not work if you use `run:` (instead of `shell:` or `script:`) + +::: + +modified from: [best practice example](https://github.com/NBISweden/snakemake_best_practice) + + +## Using a "module" system + +* Conda environments can be large and slow to create +* Some cluster operators frown upon using it + +. . . + +* UPPMAX, dardel and other clusters have a **module** command for getting access to software: + + ``` + $ module load bioinfo-tools bwa + ``` + +. . . + +* Snakemake supports this with the `envmodules:` directive: + + ```bash + rule create_bwa_index: + output: ... + input: ... + envmodules: + "bioinfo-tools", + "bwa", + conda: "envs/bwa.yaml" # ← Fallback + shell: + "bwa index {input}" + ``` + +* Run with `snakemake --use-envmodules` + +* For reproducibility, [the Snakemake documentation recommends](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#using-environment-modules) to also include a `conda:` section + + +## Containers + +::: {.incremental} + +* Containers represent another way of packaging applications +* Each container contains the application itself and **all system-level dependencies and libraries** (that is, a functional Linux installation) +* It is fully **isolated** from the other software on the machine: + By default, the tools in the container can only access what is in the container. +* The most common software for managing containers is **Docker** + +::: + + +## Containers + +### Docker nomenclature + +::: {.incremental} + +* A Docker **image** is a standalone executable package of software (on disk) +* A **Dockerfile** is a recipe used to build a Docker **image** +* A Docker **container** is a standard unit of software run on the Docker Engine + (running an image gives a container) +* **DockerHub** is an online service for sharing Docker images + +::: + +### Docker vs Singularity + +::: {.incremental} + +* On high-performance clusters (HPC), Docker is often not installed due to security concerns. + **Singularity** is often available as an alternative. +* Docker images can be converted into Singularity images +* → Singularity can be used to run Docker containers + +::: + + +## Running Snakemake jobs in containers + +Snakemake can run a jobs in a container using Singularity + +* Ensure your system has Singularity installed + +. . . + +* Find a Docker or Singularity image with the tool to run ( or [DockerHub](https://hub.docker.com/)) + +. . . + +* Add the `container:` directive to your rule: + + ```python + rule minimap2_version: + container: "docker://quay.io/biocontainers/minimap2:2.24--h5bf99c6_0" # ← "docker://" is needed + shell: + "minimap2 --version" + ``` + +. . . + +* Start your workflow on the command line with `--use-singularity` + + ```bash + $ snakemake --use-singularity -j 1 + ... + Pulling singularity image docker://quay.io/biocontainers/minimap2:2.24--h5bf99c6_0. + ... + Activating singularity image .../.snakemake/singularity/342e6ddbac7e5929a11e6ae9350454c0.simg + INFO: Converting SIF file to temporary sandbox... + 2.24-r1122 + INFO: Cleaning up image... + ... + ``` + + +## Containers – advanced topics + +::: {.incremental} + +* A [Docker image to use for *all* rules can be specified](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#running-jobs-in-containers) + +* You can package your entire workflow into a Docker image by writing a **Dockerfile**. + [See this example](https://github.com/NBISweden/workshop-reproducible-research/blob/0ee1eca78ccefbd06fbeb2c0aba37030230df90d/tutorials/containers/Dockerfile) + - Snakemake runs *inside* the container. + - To run the workflow, only Docker or Singularity is needed + +* [Conda and containers can be combined]([Snakemake documentation](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#ad-hoc-combination-of-conda-package-management-with-containers)): Specify a global container, run with `--use-conda --use-singularity`, and Snakemake creates the Conda environment within the container. + +* [Snakemake can automatically generate a Dockerfile](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#containerization-of-conda-based-workflows) + that contains all Conda environments used by the rules of the workflow using the flag + `--containerize`. + +::: + +## Summary + +There are many ways to use other **tools for reproducible research** together with Snakemake: + +::: {.incremental} + +* Use **Git** for version control, backup and share your code +* Run rules or your entire workflow in **Conda** environments +* Run your rules in isolated Docker/Singularity **containers** +* Package your entire workflow in a **Docker container** + +::: + + diff --git a/lectures/scatter-gather/Snakefile b/lectures/scatter-gather/Snakefile index c1b6ec4..b7d871e 100644 --- a/lectures/scatter-gather/Snakefile +++ b/lectures/scatter-gather/Snakefile @@ -3,11 +3,11 @@ import os samples = ["sample1", "sample2"] splits = 5 -scatteritems = range(1, splits+1) +scatteritems = [f"{split:03d}" for split in list(range(1, splits+1))] wildcard_constraints: - scatteritems = "\d+", - sample = "[\w\d\-\.]+" + scatteritems = "\\d+", + sample = "\\w+" rule all: input: @@ -23,24 +23,25 @@ rule scatter: conda: "envs/seqkit.yml" params: - parts = splits, + splits = splits, outdir = lambda wildcards, output: os.path.dirname(output[0]) shell: """ - seqkit split -p {params.parts} -O {params.outdir} {input} > {log} 2>&1 - rename 's/part_0*//' {params.outdir}/{wildcards.sample}.*.fastq + seqkit split --by-part-prefix {wildcards.sample}. -p {params.splits} -O {params.outdir} {input} > {log} 2>&1 """ -rule reversecomplement: +rule rc: output: "rc/{sample}/{sample}.{scatteritem}.rc.fastq" input: "splits/{sample}/{sample}.{scatteritem}.fastq" + log: + "logs/{sample}.{scatteritem}.rc.log" conda: "envs/seqkit.yml" shell: """ - seqkit seq --reverse --complement {input} > {output} + seqkit seq --seq-type DNA --reverse --complement {input} > {output} 2> {log} """ rule gather: @@ -49,4 +50,6 @@ rule gather: input: expand("rc/{{sample}}/{{sample}}.{scatteritem}.rc.fastq", scatteritem = scatteritems) shell: - "cat {input} > {output}" \ No newline at end of file + """ + cat {input} > {output} + """ diff --git a/lectures/scatter-gather/Snakefile_checkpoints b/lectures/scatter-gather/Snakefile_checkpoints new file mode 100644 index 0000000..6943f2b --- /dev/null +++ b/lectures/scatter-gather/Snakefile_checkpoints @@ -0,0 +1,61 @@ +import os +import random + +samples = ["sample1", "sample2"] + +wildcard_constraints: + scatteritems = "\\d+", + sample = "\\w+" + + +rule all: + input: + expand("{sample}.rc.fastq", sample = samples) + +checkpoint scatter: + output: + directory("splits/{sample}") + input: + "data/{sample}.fastq" + log: + "logs/{sample}.scatter.log" + conda: + "envs/seqkit.yml" + params: + splits = random.randint(1,10) + shell: + """ + seqkit split --by-part-prefix {wildcards.sample}. -p {params.splits} -O {output} {input} > {log} 2>&1 + """ + +rule rc: + output: + "rc/{sample}/{sample}.{scatteritem}.rc.fastq" + input: + "splits/{sample}/{sample}.{scatteritem}.fastq" + log: + "logs/{sample}.{scatteritem}.rc.log" + conda: + "envs/seqkit.yml" + shell: + """ + seqkit seq --seq-type DNA --reverse --complement {input} > {output} 2> {log} + """ + +def aggregate_input(wildcards): + checkpoint_output = checkpoints.scatter.get(sample=wildcards.sample).output[0] + scatteritems = glob_wildcards(os.path.join(checkpoint_output,"{sample}.{scatteritem}.fastq")).scatteritem + input = expand("rc/{sample}/{sample}.{scatteritem}.rc.fastq", + sample=wildcards.sample, + scatteritem=scatteritems) + return input + +rule gather: + output: + "{sample}.rc.fastq" + input: + aggregate_input + shell: + """ + cat {input} > {output} + """ diff --git a/lectures/scatter-gather/dag_scatter.png b/lectures/scatter-gather/dag_scatter.png deleted file mode 100644 index 13cb16f..0000000 Binary files a/lectures/scatter-gather/dag_scatter.png and /dev/null differ diff --git a/lectures/scatter-gather/envs/seqkit.yml b/lectures/scatter-gather/envs/seqkit.yml index 374bb8b..935d978 100644 --- a/lectures/scatter-gather/envs/seqkit.yml +++ b/lectures/scatter-gather/envs/seqkit.yml @@ -3,6 +3,6 @@ channels: - bioconda - conda-forge - defaults + - nanoporetech dependencies: - - seqkit - - rename \ No newline at end of file + - seqkit \ No newline at end of file diff --git a/lectures/scatter-gather/filegraph.png b/lectures/scatter-gather/filegraph.png deleted file mode 100644 index f3cea06..0000000 Binary files a/lectures/scatter-gather/filegraph.png and /dev/null differ diff --git a/lectures/scatter-gather/scatter-gather.Rmd b/lectures/scatter-gather/scatter-gather.Rmd deleted file mode 100644 index cdb8042..0000000 --- a/lectures/scatter-gather/scatter-gather.Rmd +++ /dev/null @@ -1,329 +0,0 @@ ---- -title: "Scatter/gather-operations in Snakemake" -subtitle: "Snakemake BYOC NBIS course" -date: "`r format(Sys.time(), '%d %B, %Y')`" -output: - xaringan::moon_reader: - self-contained: true - seal: false - css: ["default", "../template.css"] - nature: - slideNumberFormat: "" ---- - -layout: true - - - ---- - -class: center, middle - -.HUGE[Scatter/gather-operations] -
-.HUGE[in Snakemake] - -```{r Setup, echo = FALSE, message = FALSE} -# Knitr setup -knitr::opts_chunk$set(message = FALSE, - warning = FALSE) - -# Load packages -library("dplyr") -library("kableExtra") -``` - ---- - -# What does scatter/gather mean? - --- - -* .green[Scatter]: turn input into several pieces of output - --- - -* .green[Gather]: bring together (aggregate) results from the different pieces - - --- - -Snakemake now has built-in support for scatter/gather processes via the `scattergather` directive. Described further in the documentation: [Defining scatter-gather processes](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#defining-scatter-gather-processes). Currently not very flexible though. - ---- - -# When are scatter-gather processes handy? - --- - -- demultiplexing sequencing runs - - - multiple samples per plate - - split plates into separate files per sample - --- - -- extract reads from bam files - - - reads mapped to several genomes - - split sequences per genome - --- - -- parallelize analyses - - - _e.g._ multiple sequences per sample - - split input into smaller chunks and run analyses in parallell - -_etc_... - --- - -Between scattering and gathering there's some type of analyses performed. - ---- - -# The basics - -```python -DATASETS = ["a", "b", "c"] - -rule scatter: - output: - expand('{dataset}.txt', dataset=DATASETS) - input: - data = 'data.tar.gz' - shell: - """ - tar xvf {input} - """ - -rule uppercase: - input: - "{dataset}.txt" - output: - "{dataset}.uppercase.txt" - shell: - """ - tr [a-z] [A-Z] < {input} > {output} - """ - -rule gather: - output: - "aggregated.txt" - input: - expand("{dataset}.uppercase.txt", dataset=DATASETS) - shell: - """ - cat {input} > {output} - """ -``` --- - -```bash -snakemake -c 1 - -Job stats: -job count min threads max threads ---------- ------- ------------- ------------- -gather 1 1 1 -scatter 1 1 1 -uppercase 3 1 1 -total 5 1 1 -``` - ---- - -# The basics - -.center[] - ---- - -# Example: split files for parallelization - --- -- one fastq file per sample -``` -data -├── sample1.fastq -└── sample2.fastq -``` --- -- split into several files (scatter) -``` -splits -├── sample1 -│ ├── sample1.1.fastq -│ ├── sample1.2.fastq -│ ├── sample1.3.fastq -| ├── sample1.4.fastq -│ └── sample1.5.fastq -├── sample2 -| ├── sample2.1.fastq -| ├── sample2.2.fastq -| ├── sample2.3.fastq -| ├── sample2.4.fastq -└ └── sample2.5.fastq -``` ---- -# Example: split files for parallelization - -- process individual files (parallelization) -``` -rc -├── sample1 -│ ├── sample1.1.rc.fastq -│ ├── sample1.2.rc.fastq -│ ├── sample1.3.rc.fastq -| ├── sample1.4.rc.fastq -│ └── sample1.5.rc.fastq -├── sample2 -| ├── sample2.1.rc.fastq -| ├── sample2.2.rc.fastq -| ├── sample2.3.rc.fastq -| ├── sample2.4.rc.fastq -└ └── sample2.5.rc.fastq -``` --- -- aggregate results (gather) -``` -sample1.rc.fastq -sample2.rc.fastq -``` - ---- -# Example: split files for parallelization - -We start with defining the number of splits -```python -splits = 5 -scatteritems = range(1, splits+1] -``` - --- -Then define a rule to scatter each sample fastq -```python -rule scatter: - output: - expand("splits/{{sample}}/{{sample}}.{scatteritem}.fastq", scatteritem = scatteritems) - input: - "data/{sample}.fastq" - log: - "logs/{sample}.scatter.log" - conda: - "envs/seqkit.yml" - params: - parts = splits, - outdir = lambda wildcards, output: os.path.dirname(output[0]) - shell: - """ - seqkit split -p {params.parts} -O {params.outdir} {input} > {log} 2>&1 - rename 's/part_0*//' {params.outdir}/{wildcards.sample}.*.fastq - """ -``` - -Here `scatteritem` is not a wildcard because it is expanded using the `scatteritems` list. - ---- -# Example: split files for parallelization - -Next, a rule to do something with the split files per sample - -```python -rule reversecomplement: - output: - "rc/{sample}/{sample}.{scatteritem}.rc.fastq" - input: - "splits/{sample}/{sample}.{scatteritem}.fastq" - conda: - "envs/seqkit.yml" - shell: - """ - seqkit seq --reverse --complement {input} > {output} - """ -``` - -Here both `scatteritem` and `sample` are wildcards. The rule is generalized to work on any value for these wildcards. - ---- -# Example: split files for parallelization - -Then a rule to gather the results per sample - -```python -rule gather: - output: - "{sample}.rc.fastq" - input: - expand("rc/{{sample}}/{{sample}}.{scatteritem}.rc.fastq", scatteritem = scatteritems) - shell: - "cat {input} > {output}" -``` - -Here `scatteritem` is not a wildcard, but `sample` is. The rule can gather split files for any sample. - ---- -# Example: split files for parallelization - -Finally we put everything together, and define a pseudo rule 'all' that takes as input the gathered results for -all samples. - -```python -samples = ["sample1", "sample2"] - -splits = 5 -scatteritems = range(1, splits+1) - -rule all: - input: - expand("{sample}.rc.fastq", sample = samples) - -rule scatter: - output: - expand("splits/{{sample}}/{{sample}}.{scatteritem}.fastq", scatteritem = scatteritems) - input: - "data/{sample}.fastq" - -rule reversecomplement: - output: - "rc/{sample}/{sample}.{scatteritem}.rc.fastq" - input: - "splits/{sample}/{sample}.{scatteritem}.fastq" - -rule gather: - output: - "{sample}.rc.fastq" - input: - expand("rc/{{sample}}/{{sample}}.{scatteritem}.rc.fastq", scatteritem = scatteritems) -``` ---- -# Example: split files for parallelization - -```bash -snakemake -c 1 --use-conda - -Building DAG of jobs... -Job stats: -job count min threads max threads ------------------ ------- ------------- ------------- -all 1 1 1 -gather 2 1 1 -reversecomplement 10 1 1 -scatter 2 1 1 -total 15 1 1 -``` --- - -.center[] - --- - -This example workflow is available at the course GitHub repository: [workshop-snakemake-byoc/tree/main/lectures/scatter-gather/](https://github.com/NBISweden/workshop-snakemake-byoc/tree/master/lectures/example-workflow) - ---- - - -class: center, middle - -.HUGE[Questions?] diff --git a/lectures/scatter-gather/scatter-gather.html b/lectures/scatter-gather/scatter-gather.html index 42b3efe..d732291 100644 --- a/lectures/scatter-gather/scatter-gather.html +++ b/lectures/scatter-gather/scatter-gather.html @@ -1,425 +1,3387 @@ - - - Scatter/gather-operations in Snakemake - - - - - - - - - - - - - - + + - - + + + + + + + + + + + \ No newline at end of file diff --git a/lectures/scatter-gather/scatter-gather.pdf b/lectures/scatter-gather/scatter-gather.pdf deleted file mode 100644 index 68a923c..0000000 Binary files a/lectures/scatter-gather/scatter-gather.pdf and /dev/null differ diff --git a/lectures/scatter-gather/scatter-gather.qmd b/lectures/scatter-gather/scatter-gather.qmd new file mode 100644 index 0000000..3f8d5aa --- /dev/null +++ b/lectures/scatter-gather/scatter-gather.qmd @@ -0,0 +1,579 @@ +--- +title: "Scatter/gather-operations in Snakemake" +subtitle: "Snakemake BYOC NBIS course" +date: 2024-05-27 +format: + revealjs: + theme: + - white + - ../custom.scss + embed-resources: true + toc: false + toc-depth: 1 + slide-level: 2 + slide-number: true + #preview-links: true + #chalkboard: true + # Multiple logos not possible; would need to make custom logo combining both logos + footer: Snakemake BYOC 2024 - Reproducible Research + logo: https://nbis.se/nbislogo-green.svg + smaller: true + highlight-style: gruvbox +--- + + +```{r Setup, echo = FALSE, message = FALSE} +# Knitr setup +knitr::opts_chunk$set(message = FALSE, + warning = FALSE) + +# Load packages +library("dplyr") +library("kableExtra") +``` + +## What does scatter/gather mean? + +:::{.fragment} +**Scatter**: turn input into several pieces of output +::: + +:::{.fragment} +**Gather**: bring together (aggregate) results from the different pieces +::: + +:::{.fragment} + +Snakemake now has built-in support for scatter/gather processes via the `scattergather` directive. Described further in the documentation: [Defining scatter-gather processes](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#defining-scatter-gather-processes). Currently not very flexible though. +::: + +## When are scatter-gather processes handy? + +:::{.incremental} +- demultiplexing sequencing runs + + - multiple samples per plate + - split plates into separate files per sample + +- extract reads from bam files + + - reads mapped to several genomes + - split sequences per genome + +- parallelize analyses + + - split input into smaller chunks and run analyses in parallell + +::: + + +## A basic example + +```{.python code-line-numbers="|1|3-11|13-21|23-31"} +DATASETS = ["a", "b", "c"] + +rule scatter: + output: + expand('{dataset}.txt', dataset=DATASETS) + input: + data = 'data.tar.gz' + shell: + """ + tar xvf {input} + """ + +rule uppercase: + input: + "{dataset}.txt" + output: + "{dataset}.uppercase.txt" + shell: + """ + tr [a-z] [A-Z] < {input} > {output} + """ + +rule gather: + output: + "aggregated.txt" + input: + expand("{dataset}.uppercase.txt", dataset=DATASETS) + shell: + """ + cat {input} > {output} + """ +``` + +## Filegraph + +```{dot} +digraph snakemake_dag { + graph[bgcolor=white, margin=0]; + node[shape=box, style=rounded, fontname=sans, fontsize=10, penwidth=2]; + edge[penwidth=2, color=grey]; +0 [ shape=none, margin=0, label=< + +
+ + + + + + + + + + +
+ + + +
+gather +
↪ input
a.uppercase.txt
b.uppercase.txt
c.uppercase.txt
output →
aggregated.txt
>] +1 [ shape=none, margin=0, label=< + +
+ + + + +
+ + + +
+uppercase +
↪ input
{dataset}.txt
output →
{dataset}.uppercase.txt
>] +2 [ shape=none, margin=0, label=< + +
+ + + + +
+ + + + + + + +
+scatter +
↪ input
data.tar.gz
output →
a.txt
b.txt
c.txt
>] + 1 -> 0 + 2 -> 1 +} +``` + +## Another example: splitting files for parallelization + +## Splitting files for parallelization {auto-animate="true" auto-animate-easing=None} + +::: {data-id="files"} +- one fastq file per sample +``` +data +├── sample1.fastq +└── sample2.fastq +``` +::: + +## Splitting files for parallelization {auto-animate="true" auto-animate-easing=None} + +- split into several files (scatter) + +::: {data-id="files"} +``` +splits +├── sample1 +│ ├── sample1.001.fastq +│ ├── sample1.002.fastq +│ ├── sample1.003.fastq +| ├── sample1.004.fastq +│ └── sample1.005.fastq +├── sample2 +| ├── sample2.001.fastq +| ├── sample2.002.fastq +| ├── sample2.003.fastq +| ├── sample2.004.fastq +└ └── sample2.005.fastq +``` +::: + +## Splitting files for parallelization {auto-animate="true" auto-animate-easing=None} + +- process individual files (parallelization) + +::: {data-id="files"} +``` +rc +├── sample1 +│ ├── sample1.001.rc.fastq +│ ├── sample1.002.rc.fastq +│ ├── sample1.003.rc.fastq +| ├── sample1.004.rc.fastq +│ └── sample1.005.rc.fastq +├── sample2 +| ├── sample2.001.rc.fastq +| ├── sample2.002.rc.fastq +| ├── sample2.003.rc.fastq +| ├── sample2.004.rc.fastq +└ └── sample2.005.rc.fastq +``` +::: + +## Splitting files for parallelization {auto-animate="true" auto-animate-easing=None} + +- aggregate results (gather) + +::: {data-id="files"} +``` +├── sample1.rc.fastq +└── sample2.rc.fastq +``` +::: + +## Splitting files for parallelization + +We start with defining the number of splits + +```{python, echo=TRUE} +splits = 5 +scatteritems = [f"{split:03d}" for split in list(range(1, splits+1))] +scatteritems +``` + +## Splitting files for parallelization + +We also impose some constraints on the wildcards: + +```{.python} +wildcard_constraints: + scatteritems = "\\d+", + sample = "\\w+" +``` + +Here, scatteritems can be any number of digits, and sample can be any number of word characters (`[a-zA-Z0-9_]`). + +## Splitting files for parallelization + +Then define a rule to scatter each sample fastq + +```{python code=readLines("Snakefile")[15:31]} +#| echo: true +#| eval: false +``` + +Here `scatteritem` is not a wildcard because it is expanded using the `scatteritems` list. + +## Splitting files for parallelization + +Next, a rule to do something with the split files per sample + +```{python code=readLines("Snakefile")[32:45]} +#| echo: true +#| eval: false +``` + +Here both `scatteritem` and `sample` are wildcards. The rule is generalized to work on any value for these wildcards. + +## Splitting files for parallelization + +Then a rule to gather the results per sample + +```{python code=readLines("Snakefile")[46:55]} +#| echo: true +#| eval: false +``` + +Here `scatteritem` is not a wildcard, but `sample` is. The rule can gather split files for any sample. + +## Splitting files for parallelization + +Finally we put everything together, and define a pseudo rule 'all' that takes as input the gathered results for +all samples. + +```{python code=readLines("Snakefile")[11:15]} +#| echo: true +#| eval: false +``` + +## + +```{dot} +digraph snakemake_dag { + graph[bgcolor=white, margin=0]; + node[shape=box, style=rounded, fontname=sans, fontsize=24, penwidth=4]; + edge[penwidth=2, color=grey]; + 0[label = "all", color = "0.00 0.6 0.85", style="rounded"]; + 1[label = "gather", color = "0.50 0.6 0.85", style="rounded"]; + 2[label = "rc\nscatteritem: 001", color = "0.33 0.6 0.85", style="rounded"]; + 3[label = "scatter\nsample: sample1", color = "0.17 0.6 0.85", style="rounded"]; + 4[label = "rc\nscatteritem: 002", color = "0.33 0.6 0.85", style="rounded"]; + 5[label = "rc\nscatteritem: 003", color = "0.33 0.6 0.85", style="rounded"]; + 6[label = "rc\nscatteritem: 004", color = "0.33 0.6 0.85", style="rounded"]; + 7[label = "rc\nscatteritem: 005", color = "0.33 0.6 0.85", style="rounded"]; + 8[label = "gather", color = "0.50 0.6 0.85", style="rounded"]; + 9[label = "rc\nscatteritem: 001", color = "0.33 0.6 0.85", style="rounded"]; + 10[label = "scatter\nsample: sample2", color = "0.17 0.6 0.85", style="rounded"]; + 11[label = "rc\nscatteritem: 002", color = "0.33 0.6 0.85", style="rounded"]; + 12[label = "rc\nscatteritem: 003", color = "0.33 0.6 0.85", style="rounded"]; + 13[label = "rc\nscatteritem: 004", color = "0.33 0.6 0.85", style="rounded"]; + 14[label = "rc\nscatteritem: 005", color = "0.33 0.6 0.85", style="rounded"]; + 1 -> 0 + 8 -> 0 + 2 -> 1 + 4 -> 1 + 5 -> 1 + 6 -> 1 + 7 -> 1 + 3 -> 2 + 3 -> 4 + 3 -> 5 + 3 -> 6 + 3 -> 7 + 9 -> 8 + 11 -> 8 + 12 -> 8 + 13 -> 8 + 14 -> 8 + 10 -> 9 + 10 -> 11 + 10 -> 12 + 10 -> 13 + 10 -> 14 +} +``` + +This example workflow is available at the course GitHub repository: [workshop-snakemake-byoc/tree/main/lectures/scatter-gather/Snakefile](https://github.com/NBISweden/workshop-snakemake-byoc/tree/main/lectures/scatter-gather/Snakefile) + +# Dynamic output + +# Data-dependent conditional execution {transition="slide" transition-speed="slow"} + +# Checkpoints {transition="slide" transition-speed="slow"} + +## Checkpoints + +If the output of a rule is not known in advance, Snakemake can re-evaluate the workflow using **checkpoints**. + +:::{.fragment} +Several use-cases, _e.g._ clustering into an unknown number of clusters. +::: + +:::{.fragment} +Let's try this with the previous example by implementing a random number of splits. +::: + +## Checkpoints {auto-animate="true"} + +Before: number of splits defined ahead of time + +```{python code=readLines("Snakefile")[0:15]} +#| echo: true +#| eval: false +``` + +## Checkpoints {auto-animate="true"} + +Now: Number of splits will be random + +```{python code=readLines("Snakefile_checkpoints")[0:14]} +#| echo: true +#| eval: false +``` + +## Checkpoints {auto-animate="true" auto-animate-restart="true"} + +Before: The scatter rule expanded the output files + +```{.python code-line-numbers="3,11"} +rule scatter: + output: + expand("splits/{{sample}}/{{sample}}.{scatteritem}.fastq", scatteritem = scatteritems) + input: + "data/{sample}.fastq" + log: + "logs/{sample}.scatter.log" + conda: + "envs/seqkit.yml" + params: + splits = splits, + outdir = lambda wildcards, output: os.path.dirname(output[0]) + shell: + """ + seqkit split --by-part-prefix {wildcards.sample}. -p {params.splits} -O {params.outdir} {input} > {log} 2>&1 + """ +``` + +## Checkpoints {auto-animate="true"} + +Now: The scatter rule becomes a checkpoint with unknown number of output files + +```{.python code-line-numbers="3,11"} +checkpoint scatter: + output: + directory("splits/{sample}") + input: + "data/{sample}.fastq" + log: + "logs/{sample}.scatter.log" + conda: + "envs/seqkit.yml" + params: + splits = random.randint(1,10) + shell: + """ + seqkit split --by-part-prefix {wildcards.sample}. -p {params.splits} -O {output} {input} > {log} 2>&1 + """ +``` + +## Checkpoints + +The `rc` rule is left unchanged + +```{python code=readLines("Snakefile")[32:45]} +#| echo: true +#| eval: false +``` + +## Checkpoints {auto-animate="true"} + +Before: The gather rule expanded the input files + +```{.python code-line-numbers="5"} +rule gather: + output: + "{sample}.rc.fastq" + input: + expand("rc/{{sample}}/{{sample}}.{scatteritem}.rc.fastq", scatteritem = scatteritems) + shell: + """ + cat {input} > {output} + """ +``` + +## Checkpoints {auto-animate="true"} + +Now: we use an input function and the built-in `glob_wildcards` + +```{.python code-line-numbers="1-7,13"} +def aggregate_input(wildcards): + checkpoint_outdir = checkpoints.scatter.get(sample=wildcards.sample).output[0] + scatteritems = glob_wildcards(os.path.join(checkpoint_outdir,"{sample}.{scatteritem}.fastq")).scatteritem + input = expand("rc/{sample}/{sample}.{scatteritem}.rc.fastq", + sample=wildcards.sample, + scatteritem=scatteritems) + return input + +rule gather: + output: + "{sample}.rc.fastq" + input: + aggregate_input + shell: + """ + cat {input} > {output} + """ +``` + +## Checkpoints {auto-animate="true"} + +```{.python code-line-numbers="2"} +def aggregate_input(wildcards): + checkpoint_outdir = checkpoints.scatter.get(sample=wildcards.sample).output[0] + scatteritems = glob_wildcards(os.path.join(checkpoint_outdir,"{sample}.{scatteritem}.fastq")).scatteritem + input = expand("rc/{sample}/{sample}.{scatteritem}.rc.fastq", + sample=wildcards.sample, + scatteritem=scatteritems) + return input +``` + +- Get the output directory of the scatter checkpoint for the sample (`checkpoint_outdir='splits/sample1'`) + +## Checkpoints {auto-animate="true"} + +```{.python code-line-numbers="3" code=} +def aggregate_input(wildcards): + checkpoint_outdir = checkpoints.scatter.get(sample=wildcards.sample).output[0] + scatteritems = glob_wildcards(os.path.join(checkpoint_outdir,"{sample}.{scatteritem}.fastq")).scatteritem + input = expand("rc/{sample}/{sample}.{scatteritem}.rc.fastq", + sample=wildcards.sample, + scatteritem=scatteritems) + return input +``` + +- Use `glob_wildcards` to infer the scatteritem wildcard based on existing files +- If `splits=3`, `scatteritems=["001", "002", "003"]` + +## Checkpoints {auto-animate="true"} + +```{.python code-line-numbers="4"} +def aggregate_input(wildcards): + checkpoint_outdir = checkpoints.scatter.get(sample=wildcards.sample).output[0] + scatteritems = glob_wildcards(os.path.join(checkpoint_outdir,"{sample}.{scatteritem}.fastq")).scatteritem + input = expand("rc/{sample}/{sample}.{scatteritem}.rc.fastq", + sample=wildcards.sample, + scatteritem=scatteritems) + return input +``` +- The sample wildcard is known (`sample='sample1'`) + +## Checkpoints {auto-animate="true"} + +```{.python code-line-numbers="6"} +def aggregate_input(wildcards): + checkpoint_outdir = checkpoints.scatter.get(sample=wildcards.sample).output[0] + scatteritems = glob_wildcards(os.path.join(checkpoint_outdir,"{sample}.{scatteritem}.fastq")).scatteritem + input = expand("rc/{sample}/{sample}.{scatteritem}.rc.fastq", + sample=wildcards.sample, + scatteritem=scatteritems) + return input +``` + +- `scatteritem` is expanded using the inferred `scatteritems` list + +## Checkpoints {auto-animate="true" .code.sstree} + +```{.python code-line-numbers="7"} +def aggregate_input(wildcards): + checkpoint_outdir = checkpoints.scatter.get(sample=wildcards.sample).output[0] + scatteritems = glob_wildcards(os.path.join(checkpoint_outdir,"{sample}.{scatteritem}.fastq")).scatteritem + input = expand("rc/{sample}/{sample}.{scatteritem}.rc.fastq", + sample=wildcards.sample, + scatteritem=scatteritems) + return input +``` + +- returned input becomes: + +```{.python code-line-numbers="false"} +["rc/sample1/sample1.001.rc.fastq", +"rc/sample1/sample1.002.rc.fastq", +"rc/sample1/sample1.003.rc.fastq"] +``` + +## + +```{dot} +digraph snakemake_dag { + graph[bgcolor=white, margin=0]; + node[shape=box, style=rounded, fontname=sans, fontsize=10, penwidth=2]; + edge[penwidth=2, color=grey]; + 0[label = "all", color = "0.33 0.6 0.85", style="rounded"]; + 1[label = "gather", color = "0.50 0.6 0.85", style="rounded"]; + 2[label = "scatter\nsample: sample1", color = "0.00 0.6 0.85", style="rounded"]; + 3[label = "gather", color = "0.50 0.6 0.85", style="rounded"]; + 4[label = "scatter\nsample: sample2", color = "0.00 0.6 0.85", style="rounded"]; + 1 -> 0 + 3 -> 0 + 2 -> 1 + 4 -> 3 +} +``` + +This example workflow is available at the course GitHub repository: [workshop-snakemake-byoc/tree/main/lectures/scatter-gather/Snakefile_checkpoints](https://github.com/NBISweden/workshop-snakemake-byoc/tree/main/lectures/scatter-gather/Snakefile_checkpoints) + +## Questions? \ No newline at end of file diff --git a/lectures/welcome/john-sundh.jpg b/lectures/welcome/john-sundh.jpg index bc047ab..519eb45 100644 Binary files a/lectures/welcome/john-sundh.jpg and b/lectures/welcome/john-sundh.jpg differ