diff --git a/lectures/anatomy-of-a-rule/anatomy.html b/lectures/anatomy-of-a-rule/anatomy.html index 4bdadcc..89ef961 100644 --- a/lectures/anatomy-of-a-rule/anatomy.html +++ b/lectures/anatomy-of-a-rule/anatomy.html @@ -1,652 +1,2910 @@ - -
-Snakemake BYOC NBIS course
+ +2024-05-27
+$ snakemake -c 1
+Assuming unrestricted shared filesystem usage.
+Building DAG of jobs...
+Using shell: /bin/bash
+Provided cores: 1 (use --cores to define parallelism)
+Rules claiming more threads will be scaled down.
+Job stats:
+job count
+----- -------
+1 1
+total 1
+
+Select jobs to execute...
+Execute 1 jobs...
+
+[Fri May 17 23:47:24 2024]
+localrule 1:
+ output: results/sample1.stats.txt
+ jobid: 0
+ reason: Missing output files: results/sample1.stats.txt
+ resources: tmpdir=/var/folders/wb/jf9h8kw11b734gd98s6174rm0000gp/T
+
+[Fri May 17 23:47:24 2024]
+Finished job 0.
+1 of 1 steps (100%) done
+Complete log: .snakemake/log/2024-05-17T234724.252920.snakemake.log
More commonly, rules are named and have both input and output:
+ +Rules are linked by their input and output files:
+ +You can also link rules explicitly:
+rule a:
+ output: "a.txt"
+ shell:
+ "echo 'a' > a.txt"
+
+rule b:
+ input: rules.a.output
+ output: "b.txt"
+ shell:
+ "cat {input} > {output}"
but then the rule that supplies the file must be define before the rule that uses it.
+Wildcards generalize a workflow. Imagine you have not just sample1 but samples 1..100.
+Instead of writing 100 rules…
+ +…we can introduce one or more wildcards
which Snakemake can match to several text strings using regular expressions.
In our example, we replace the actual sample ids with the wildcard sample
:
Rules can have multiple wildcards…
+ +…but all the wildcards must be present in the output section.
+Will work:
+ +Ambiguities can arise when two rules produce the same output:
+rule generate_stats:
+ output: "results/{sample}.stats.txt"
+ input: "results/{sample}.bam"
+ shell:
+ """
+ samtools flagstat {input} > {output}
+ """
+
+rule print_stats:
+ output: "results/{sample}.stats.txt"
+ input: "results/{sample}.log"
+ shell:
+ """
+ grep "% alignment" {input} > {output}
+ """
+
+rule make_report:
+ output: "results/{sample}.report.pdf"
+ input: "results/{sample}.stats.txt"
This can be handled in a number of ways:
+Logfiles and messages add descriptions and help with debugging:
+rule generate_stats:
+ output: "results/{sample}.stats.txt"
+ input: "results/{sample}.bam"
+ log: "results/{sample}.flagstat.log"
+ message: "Generating stats for sample {wildcards.sample}"
+ shell:
+ """
+ samtools flagstat {input} > {output} 2>{log}
+ """
Tip
+Log files are not deleted by snakemake if there’s an error.
+Compute resources can be set with threads and resources:
+rule generate_stats:
+ output: "results/{sample}.stats.txt"
+ input: "results/{sample}.bam"
+ log: "results/{sample}.flagstat.log"
+ message: "Generating stats for sample {wildcards.sample}"
+ threads: 4
+ resources:
+ mem_mb=100
+ shell:
+ """
+ samtools flagstat --threads {threads} {input} > {output} 2>{log}
+ """
It’s also possible to set threads based on the cores given to snakemake (e.g. --cores 8
or -c 8
).
rule generate_stats:
+ output: "results/{sample}.stats.txt"
+ input: "results/{sample}.bam"
+ log: "results/{sample}.flagstat.log"
+ message: "Generating stats for sample {wildcards.sample}"
+ threads: workflow.cores * 0.5
+ resources:
+ mem_mb=100
+ shell:
+ """
+ samtools flagstat --threads {threads} {input} > {output} 2>{log}
+ """
Resources can also be callables, allowing them to be set dynamically:
+rule generate_stats:
+ output: "results/{sample}.stats.txt"
+ input: "results/{sample}.bam"
+ log: "results/{sample}.flagstat.log"
+ message: "Generating stats for sample {wildcards.sample}"
+ threads: workflow.cores * 0.5
+ resources:
+ mem_mb=lambda wildcards: 1000 if wildcards.sample == "sample1-large" else 100
+ shell:
+ """
+ samtools flagstat --threads {threads} {input} > {output} 2>{log}
+ """
Non-file rule parameters can be set with the params directive:
+rule generate_stats:
+ output: "results/{sample}.stats.txt"
+ input: "results/{sample}.bam"
+ log: "results/{sample}.flagstat.log"
+ message: "Generating stats for sample {wildcards.sample}"
+ threads: workflow.cores * 0.5
+ resources:
+ mem_mb=100
+ params:
+ verbosity = 2
+ shell:
+ """
+ samtools flagstat --verbosity {params.verbosity} \
+ --threads {threads} {input} > {output} 2>{log}
+ """
Software environments can be set for each rule using the conda:
directive:
rule generate_stats:
+ output: "results/{sample}.stats.txt"
+ input: "results/{sample}.bam"
+ log: "results/{sample}.flagstat.log"
+ message: "Generating stats for sample {wildcards.sample}"
+ threads: workflow.cores * 0.5
+ resources:
+ mem_mb=100
+ params:
+ verbosity = 2
+ conda: "envs/samtools.yml"
+ shell:
+ """
+ samtools flagstat --verbosity {params.verbosity} \
+ --threads {threads} {input} > {output} 2>{log}
+ """
Contents of envs/samtools.yml
To make Snakemake use the conda environment, specify --software-deployment-method conda
(or --sdm conda
) on the command line. For Snakemake versions before 8.0, use --use-conda
.
On compute clusters, you can also specify packages to load with envmodules:
rule generate_stats:
+ output: "results/{sample}.stats.txt"
+ input: "results/{sample}.bam"
+ log: "results/{sample}.flagstat.log"
+ message: "Generating stats for sample {wildcards.sample}"
+ threads: workflow.cores * 0.5
+ resources:
+ mem_mb=100
+ params:
+ verbosity = 2
+ conda: "envs/samtools.yml"
+ envmodules:
+ "bioinfo-tools",
+ "samtools"
+ shell:
+ """
+ samtools flagstat --verbosity {params.verbosity} \
+ --threads {threads} {input} > {output} 2>{log}
+ """
On compute clusters, you can also specify packages to load with envmodules:
To make Snakemake use envmodules, specify --use-envmodules
on the command line.
Config files allow you to configure workflows without having to change the underlying code.
+ +Specify one or more config files on the command line with:
+ + +The config parameters are available as a dictionary inside your snakefiles and can be accessed from within rules:
+rule all:
+ input:
+ expand("results/{sample}.stats.txt", sample = config["samples"])
+
+rule generate_stats:
+ output: "results/{sample}.stats.txt"
+ input: "results/{sample}.bam"
+ log: "results/{sample}.flagstat.log"
+ message: "Generating stats for sample {wildcards.sample}"
+ threads: workflow.cores * 0.5
+ resources:
+ mem_mb=100
+ params:
+ verbosity = config["verbosity"]
+ conda: "envs/samtools.yml"
+ envmodules:
+ "bioinfo-tools",
+ "samtools"
+ shell:
+ """
+ samtools flagstat --verbosity {params.verbosity} \
+ --threads {threads} {input} > {output} 2>{log}
+ """
Snakemake is constantly being updated with new features. Check out the documentation, and specifically the section about writing rules.
+Snakemake BYOC 2024 - Reproducible Research
+