Module 2: Processes and Software Dependencies

Learning Objectives

  1. Work on a practical example of a nextflow pipeline
  2. Understand process inputs and outputs
  3. Understand process directives
  4. Use directives module and container to specify process dependencies
  5. Understand nextflow configuration

2.1 Pipeline Processes

  • Open and take a look.

  • Here we have a single process INDEX defined. This process builds the Salmon Index of the transcriptome provided.

    process INDEX {
        path transcriptome
        path 'salmon_index'
        salmon index -t $transcriptome -i salmon_index
  • We can see this process definies an input of type path (i.e. a file)

  • In the script section, the variable $transcriptome will evaluate to the name of the input file

  • The output is also of type path, and declares that a file named 'salmon_index' should be created

  • If the output file does not exist after the process has run, Nextflow will throw an error

Exercise 2.1

  1. Run
    nextflow run ~/wehi-nextflow-training/module_2/
    This will fail with error message: line 2: salmon: command not found. This is because we haven't provided a specification for the software required.
  2. Check for salmon module on milton
    module avail salmon
    We can see that salmon is indeed available.
  3. Add the directive module salmon/1.9.0 to the INDEX process as follows and run the pipeline again.
    process INDEX {
        module 'salmon/1.9.0'
        path transcriptome
        path 'salmon_index'
        salmon index -t $transcriptome -i salmon_index
    With this, Nextflow will load the appropriate module prior to running the process script.

2.2 Process Directives

  • Directives specify the execution environment of a nextflow process
  • For example the module directive above specifies the software modules to be used
  • Directives are placed at the top of a process definition
  • See for all available directives

Exercise 2.2

  1. Add the following directives to INDEX to specify the cpu and memory resources required by the process.

    memory '2 GB'
    cpus 1
  2. Run the workflow again and confirm it runs successfully.

    process INDEX {
        module 'salmon/1.9.0'
        memory '2 GB'
        cpus 1
        path transcriptome
        path 'salmon_index'
        salmon index -t $transcriptome -i salmon_index

2.3 Process inputs and outputs

  • Open and take a look.
  • Here we have added a process QUANTIFICATION. This process takes the RNA-seq data and counts the reads originating from each transcrpit in the transcriptome:
    process QUANTIFICATION {
        module 'salmon/1.9.0'
        memory '2 GB'
        cpus 2
        tag "$sample_id"
        path salmon_index
        tuple val(sample_id), path(reads)
        path output
        output = "${sample_id}.sf"
        salmon quant --threads $task.cpus --libType=U -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o out
        mv out/quant.sf $output
  • Process inputs may be of the following types:
    • val - A val type denotes a regular groovy variable. It could be a String, Integer, Boolean, double etc.
    • path - A path represents an input/output file.
    • stdout - stdout is a special output type that will return the standard output of the process run
    • tuple - A tuple represents a collection of inputs/out. These may be of either val, path or stdout types
  • Note that QUANTIFICATION defines two inputs channels, one for the salmon index and one for the reads for each sample
  • Channel.fromFilePairs() is a special method designed to handle paired file inputs from NGS sequencing

2.4 Software Containers

  • One of the most powerful features of Nextflow is it's support for software containers (Docker, Singularity, etc.).
  • Using containers will improve the reproducibility and portability of your pipelines.
  • Containers can be specified using the container directive.
  • You can find pre-made containers for popular bioinformatics software through Bioconda

Exercise 2.4

  1. Visit Here we see Salmon is available in a Docker container at "". If we visit the "salmon/tags" link we can find that the latest available tag is "1.9.0--h7e5ed60_1"
  2. Replace the directive module 'salmon/1.9.0' with container '' in the processes INDEX and QUANTIFICATION in
  3. Run
    nextflow run ~/wehi-nextflow-training/module_2/

2.5 Configurtaion

  • Look at ~/.nextflow/config. This provides system wide nextflow configuration, and is tailored to Milton (it was created when you first loaded the nextflow module).
    process {
        executor = 'slurm'
        cache = 'lenient'
    executor {
        name = 'slurm'
        queueSize = 100
        queueStatInterval = '10 sec'
        pollInterval = '10 sec'
        submitRateLimit = '10sec'
    singularity {
        enabled = true
        autoMounts = true
        runOptions = '-B /vast -B /stornext -B /wehisan'
    docker.enabled = false
  • When a file named nextflow.config is present in the same directory as a nextflow script, it provides project-level configuration to be used when running that script.
  • Any settings provided by both the system wide ~/.nextflow/config and project nextflow.config are overridden by the project nextflow.config
  • Open at nextflow.config and take a configuration at the settings. The default 'slurm' executor is overwritten to use the 'local' executor.
  • see

2.6 Publishing Outputs

  • Open and take a look.
  • Here we have added a process PLOT_TPM. This process is an R script that takes RNA seq quantification results and creates a plot:
    process PLOT_TPM {
        container 'rocker/tidyverse:4.1.3'
        memory '2 GB'
        cpus 1
        publishDir "results", mode: 'copy'
        path quant_results
        path 'TPM.png'
        #!/usr/bin/env Rscript
        data.frame(filename = list.files(pattern='.sf')) %>% 
            mutate(tissue = str_remove(basename(filename), '.sf')) %>% 
            mutate(data = map(filename, read_tsv, col_types = cols())) %>% 
            unnest(data) %>% 
            select(tissue, transcript = Name, TPM) %>% 
            ggplot(aes(transcript, TPM, fill = tissue)) +
            geom_col(position = 'dodge') +
        ggsave('TPM.png', width = 6, height = 4)
  • The publishDir directives specifies that output files from this process should be copied to the folder 'results'
  • The operator collect() is used to combine all the outputs in quant_ch into a single input for PLOT_TPM
    quant_ch = QUANTIFICATION(index_ch, read_pairs_ch)

Exercise 2.6

  1. Open at nextflow.config and change process.executor from 'local' to 'slurm'. This will direct jobs to be submitted to the slurm queue.
  2. Run and observe the output
    nextflow run ~/wehi-nextflow-training/module_2/ -resume

2.7 Execution Log

Exercise 2.7

  1. Run nextflow log to list all previous executions, and note the RUN NAME of the most recent execution
  2. Run nextflow log <RUN NAME> -f name,workdir,native_id,status,exit replacing <RUN NAME> with the name from 1. This will list all jobs run in the previous execution.