Skip to content

Advanced parameters features: subworkflows, environments and parameter nesting

Wise, Aaron edited this page Apr 9, 2018 · 6 revisions

Subworkflows.

ZIPPY supports workflows that call into other workflows through the use of 'subworkflows'.

A valid subworkflow file is a standard ZIPPY json file that may contain:

  1. Wildcards
  2. Stages
  3. Dockers

That means that all global-level variables required by stages in the subworkflow must currently be defined in the top-level workflow params file. (This is something we hope to improve some day.)

Wildcards, stages and dockers may not conflict by identifer/name with entities at the top level / in other subworkflows. Wildcards will only substitute at the same level (i.e., in the same file) as they are defined.

Dockers defined in a subworkflow may be used anywhere (which makes life easy for us technically), but it is generally considered bad practice to bubble up a docker from a sub-workflow to a higher level workflow.

A subworkflow is defined in the 'stages' section of a ZIPPY workflow. Here is an example:

        {
            "subworkflow": "path/to/subworkflow.json",
            "identifier": "_sub",
            "previous_stage": {"bwa_sub": "bcl2fastq"}
        }

The 'subworkflow' defines the path to the subworkflow json file. The 'identifier' is a string with is suffixed to the identifier of all stages in the subworkflow. (In our example here, the subworkflow has a stage with identifier 'bwa'. Once we define the subworkflow identifier '_sub', the new canonical identifier for that stage becomes 'bwa_sub'.) The 'previous_stage' defines a map from subworkflow stages to the top-level dependencies.

Depending on subworkflows

Currently, subworkflow stages can be called directly by identifier. This breaks encapsulation, so in the future, we will likely require the subworkflow definition to explicitly expose stages that downstream stages can depend on. To be forwards compatible, you can add a list 'output_stages' to your subworkflow definition, which lists the identifier (with suffix) of stages which you would like to expose.

Environments

An 'environment' in ZIPPY is defined as a standard ZIPPY json params file which is used to In either a stage, or at the top level, can set 'environment' to a path. The environment is a simple json file containing ONLY a top level map.

Nesting

ZIPPY global parameters can be defined at several layers of the hierarchy. For example, let's say you want to test alignment to two different genomes in the same workflow. Typically, ZIPPY defines the genome folder as self.params.genome, and it's assumed to be the same for every stage. However, you can break this assumption by redefining the genome in specific stages where needed. Here's an example of stages that would run bwa twice, once with the 'default' genome path, and once with an overridden genome path:

"stages": [...
        {
            "identifier": "alignment1", 
            "output_dir": "output1", 
            "previous_stage": "bcl2fastq", 
            "stage": "bwa",
        }, 
        {
            "identifier": "alignment2", 
            "output_dir": "output2", 
            "previous_stage": "bcl2fastq", 
            "stage": "bwa",
            "genome": "path/to/alt/genome"
        }
...]

In more detail, here is the complete lookup priority for different parameter types. Values closer to 0 override values lower in the hierarchy:

self.params.self.x

  1. defined in the stage
  2. defined in the stage environment json file
  3. Optional default value (only if a parameter is optional, and an optional value has been defined)

self.params.x

  1. defined in the stage
  2. defined in the stage environment
  3. defined in the main json file
  4. defined in the top level environment json file
  5. Optional default value (only if a parameter is optional, and an optional value has been defined)

Notes for Developers

Here is some implementation details for how these features work.

At json load time:

  • the subworkflows will be expanded and added natively to the workflow
  • environment variables will be added to the params object if their corresponding value is not already defined

At param lookup time:

  • when .self is encountered, a flag (self._self_mode) is set so that the next non-self lookup will be in 'self mode'.
  • getattribute has been overridden to first check in the params.self namespace, and then (if not in self mode) to look in the broader namespace. It then sets the self_mode flag to false
  • downside: lookups to the params object are not threadsafe due to the potential of self_flag states to collide. This shouldn't be a big issue.

When json is saved:

  • environment variables and subworkflows will have been merged, and so a single file is written out.