Update Step Functions Example (#79)

Bug Fixes * explicitly stop ecs before starting ebs autoscale on /var/lib/docker * move nextflow additions to before start ecs * add steps missing that are documented in https://docs.docker.com/storage/storagedriver/btrfs-driver/ Improvements * Adding SSM agent and permissions to Batch hosts to allow SSM capabilities like Session Manager to facilitate troubleshooting via SSH without needing an EC2 keypair. * refactor containers and job defs * use host bind mounted awscli * use job def environment variables for execution options * use common entrypoint script for all containers * update sfn example to use dynamic parallelism * remove unneeded parameters from job definitions * update example workflow input * update build dependencies * explicitly add pip * unpin cfn-lint. we need this to stay up to date. * use common build script for tooling containers * add container build template * refactor step functions stack into separate templates * create a generic workflow template that uses nested templates to build individual containers and the state machine for the workflow * simplify the workflow definition templates - the container builds and IAM role creation happens in parent templates * add UpdateReplacePolicy for S3 Buckets Documentation Updates * update nextflow documentation * fix a couple inconsistencies * improve flow and clarity * typo fixes * update step functions docs * update images * add more details on job definition and sfn task * add more details on the example workflow * fix job output prefix in example input * update workflow completion time * add more detailed explanations of important job def parts and how they translate into sfn task code.
aws-samples · Dec 21, 2019 · 708e176 · 708e176
1 parent 3218203
commit 708e176
Show file tree

Hide file tree

Showing 31 changed files with 1,318 additions and 769 deletions.
diff --git a/_scripts/test.sh b/_scripts/test.sh
@@ -3,6 +3,7 @@
 set -e
 
 # check cfn templates for errors
+cfn-lint --version
 cfn-lint src/templates/**/*.template.yaml
 
 # make sure that site can build

diff --git a/docs/orchestration/nextflow/nextflow-overview.md b/docs/orchestration/nextflow/nextflow-overview.md
@@ -8,7 +8,8 @@ Nextflow can be run either locally or on a dedicated EC2 instance.  The latter i
 
 ## Full Stack Deployment
 
-The following CloudFormation template will launch an EC2 instance pre-configured for using Nextflow.
+_For the impatient:_
+The following CloudFormation template will create all the resources you need to runs Nextflow using the architecture shown above.  It combines the CloudFormation stacks referenced below in the [Requirements](#requirements) section.
 
 | Name | Description | Source | Launch Stack |
 | -- | -- | :--: | -- |
@@ -20,14 +21,24 @@ When the above stack is complete, you will have a preconfigured Batch Job Defini
 
 To get started using Nextflow on AWS you'll need the following setup in your AWS account:
 
-* The core set of resources (S3 Bucket, IAM Roles, AWS Batch) described in the [Getting Started](../../../core-env/introduction) section.
-* A containerized `nextflow` executable that pulls configuration and workflow definitions from S3
+* The core set of resources (S3 Bucket, IAM Roles, AWS Batch) described in the [Core Environment](../../../core-env/introduction) section.
+
+If you are in a hurry, you can create the complete Core Environment using the following CloudFormation template:
+
+| Name | Description | Source | Launch Stack |
+| -- | -- | :--: | :--: |
+{{ cfn_stack_row("GWFCore (Existing VPC)", "GWFCore-Full", "aws-genomics-root-novpc.template.yaml", "Create EC2 Launch Templates, AWS Batch Job Queues and Compute Environments, a secure Amazon S3 bucket, and IAM policies and roles within an **existing** VPC. _NOTE: You must provide VPC ID, and subnet IDs_.") }}
+
+!!! note
+    The CloudFormation above does **not** create a new VPC, and instead will create associated resources in an existing VPC of your choosing, or your default VPC.  To automate creating a new VPC to isolate your resources, you can use the [AWS VPC QuickStart](https://aws.amazon.com/quickstart/architecture/vpc/).
+
+* A containerized `nextflow` executable with a custom entrypoint script that draws configuration information from AWS Batch supplied environment variables
 * The AWS CLI installed in job instances using `conda`
 * A Batch Job Definition that runs a Nextflow head node
-* An IAM Role for the Nextflow head node job that allows it access to AWS Batch
-* (optional) An S3 Bucket to store your Nextflow workflow definitions
+* An IAM Role for the Nextflow head node job that allows it to submit AWS Batch jobs
+* (optional) An S3 Bucket to store your Nextflow session cache
 
-The last five items above are created by the following CloudFormation template:
+The five items above are created by the following CloudFormation template:
 
 | Name | Description | Source | Launch Stack |
 | -- | -- | :--: | -- |
@@ -181,6 +192,9 @@ chown -R ec2-user:ec2-user $USER/miniconda
 rm Miniconda3-latest-Linux-x86_64.sh
 ```
 
+!!! note
+    The actual Launch Template used in the [Core Environment](../../core-env/introduction.md) does a couple more things, like installing additional resources for [managing space for the job](../../core-env/create-custom-compute-resources.md)
+
 ### Batch job definition
 
 An AWS Batch Job Definition for the containerized Nextflow described above is shown below.
@@ -374,7 +388,7 @@ You can customize these job definitions to incorporate additional environment va
 !!! important
     Instances provisioned using the Nextflow specific EC2 Launch Template configure `/var/lib/docker` in the host instance to use automatically [expandable scratch space](../../../core-env/create-custom-compute-resources/), allowing containerized jobs to stage as much data as needed without running into disk space limits.
 
-### Running the workflow
+### Running workflows
 
 To run a workflow you submit a `nextflow` Batch job to the appropriate Batch Job Queue via:
 

diff --git a/docs/orchestration/step-functions/images/cfn-stack-outputs-statemachineinput.png b/docs/orchestration/step-functions/images/cfn-stack-outputs-statemachineinput.png
diff --git a/docs/orchestration/step-functions/images/sfn-console-execution-inprogress.png b/docs/orchestration/step-functions/images/sfn-console-execution-inprogress.png
diff --git a/docs/orchestration/step-functions/images/sfn-console-start-execution-dialog.png b/docs/orchestration/step-functions/images/sfn-console-start-execution-dialog.png
diff --git a/docs/orchestration/step-functions/images/sfn-console-statemachine.png b/docs/orchestration/step-functions/images/sfn-console-statemachine.png
diff --git a/docs/orchestration/step-functions/images/sfn-example-mapping-state-machine.png b/docs/orchestration/step-functions/images/sfn-example-mapping-state-machine.png
diff --git a/docs/orchestration/step-functions/step-functions-overview.md b/docs/orchestration/step-functions/step-functions-overview.md
@@ -41,6 +41,7 @@ State machines that use AWS Batch for job execution and send events to CloudWatc
     "Version": "2012-10-17",
     "Statement": [
         {
+            "Sid": "enable submitting batch jobs",
             "Effect": "Allow",
             "Action": [
                 "batch:SubmitJob",
@@ -64,9 +65,39 @@ State machines that use AWS Batch for job execution and send events to CloudWatc
 }
 ```
 
+For more complex workflows that use nested workflows or require more complex input parsing, you need to add additional permissions for executing Step Functions State Machines and invoking Lambda functions:
+
+```json
+{
+    "Version": "2012-10-17",
+    "Statement": [
+        {
+            "Sid": "enable calling lambda functions",
+            "Effect": "Allow",
+            "Action": [
+                "lambda:InvokeFunction"
+            ],
+            "Resource": "*"
+        },
+        {
+            "Sid": "enable calling other step functions",
+            "Effect": "Allow",
+            "Action": [
+                "states:StartExecution"
+            ],
+            "Resource": "*"
+        },
+        ...
+    ]
+}
+```
+
+!!! note
+    All `Resource` values in the policy statements above can be scoped to be more specific if needed.
+
 ## Step Functions State Machine
 
-Workflows in AWS Step Functions are built using [Amazon States Language](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-amazon-states-language.html) (ASL), a declarative, JSON-based, structured language used to define your state machine, a collection of states that can do work (Task states), determine which states to transition to next (Choice states), stop an execution with an error (Fail states), and so on.
+Workflows in AWS Step Functions are built using [Amazon States Language](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-amazon-states-language.html) (ASL), a declarative, JSON-based, structured language used to define a "state-machine".  An AWS Step Functions State-Machine is a collection of states that can do work (Task states), determine which states to transition to next (Choice states), stop an execution with an error (Fail states), and so on.
 
 ### Building workflows with AWS Step Functions
 
@@ -123,9 +154,7 @@ Step Functions [ASL documentation](https://docs.aws.amazon.com/step-functions/la
 
 ### Batch Job Definitions
 
-It is recommended to have [Batch Job Definitions](https://docs.aws.amazon.com/batch/latest/userguide/job_definitions.html) created for your tooling prior to building a Step Functions state machine.  These can then be referenced in state machine `Task` states by their respective ARNs.
-
-Step Functions will use the Batch Job Definition to define compute resource requirements and parameter defaults for the Batch Job it submits.
+[AWS Batch Job Definitions](https://docs.aws.amazon.com/batch/latest/userguide/job_definitions.html) are used to define compute resource requirements and parameter defaults for an AWS Batch Job.  These are then referenced in state machine `Task` states by their respective ARNs.
 
 An example Job Definition for the `bwa-mem` sequence aligner is shown below:
 
@@ -134,58 +163,85 @@ An example Job Definition for the `bwa-mem` sequence aligner is shown below:
     "jobDefinitionName": "bwa-mem",
     "type": "container",
     "parameters": {
-        "InputReferenceS3Prefix": "s3://<bucket-name>/reference",
-        "InputFastqS3Path1": "s3://<bucket-name>/<sample-name>/fastq/read1.fastq.gz",
-        "InputFastqS3Path2": "s3://<bucket-name>/<sample-name>/fastq/read2.fastq.gz",
-        "OutputS3Prefix": "s3://<bucket-name>/<sample-name>/aligned"
+        "threads": "8"
     },
     "containerProperties": {
         "image": "<dockerhub-user>/bwa-mem:latest",
         "vcpus": 8,
         "memory": 32000,
         "command": [
-            "Ref::InputReferenceS3Prefix",
-            "Ref::InputFastqS3Path1",
-            "Ref::InputFastqS3Path2",
-            "Ref::OutputS3Prefix",
+            "bwa", "mem",
+            "-t", "Ref::threads",
+            "-p",
+            "reference.fasta",
+            "sample_1.fastq.gz"
         ],
         "volumes": [
             {
                 "host": {
                     "sourcePath": "/scratch"
                 },
                 "name": "scratch"
+            },
+            {
+                "host": {
+                    "sourcePath": "/opt/miniconda"
+                },
+                "name": "aws-cli"
+            }
+        ],
+        "environment": [
+            {
+                "name": "REFERENCE_URI",
+                "value": "s3://<bucket-name>/reference/*"
+            },
+            {
+                "name": "INPUT_DATA_URI",
+                "value": "s3://<bucket-name>/<sample-name>/fastq/*.fastq.gz"
+            },
+            {
+                "name": "OUTPUT_DATA_URI",
+                "value": "s3://<bucket-name>/<sample-name>/aligned"
             }
         ],
-        "environment": [],
         "mountPoints": [
             {
                 "containerPath": "/opt/work",
                 "sourceVolume": "scratch"
+            },
+            {
+                "containerPath": "/opt/miniconda",
+                "sourceVolume": "aws-cli"
             }
         ],
         "ulimits": []
     }
 }
 ```
 
-!!! note
-    The Job Definition above assumes that `bwa-mem` has been containerized with an
-    `entrypoint` script that handles Amazon S3 URIs for input and output data
-    staging.
+There are three key parts of the above definition to take note of.
 
-    Because data staging requirements can be unique to the tooling used, neither AWS Batch nor Step Functions handles this automatically.
+* Command and Parameters
+
+    The **command** is a list of strings that will be sent to the container.  This is the same as the `...` arguments that you would provide to a `docker run mycontainer ...` command.
+
+    **Parameters** are placeholders that you define whose values are substituted when a job is submitted.  In the case above a `threads` parameter is defined with a default value of `8`.  The job definition's `command` references this parameter with `Ref::threads`.
+
+    !!! note
+        Parameter references in the command list must be separate strings - concatenation with other parameter references or static values is not allowed.
+
+* Environment
+
+    **Environment** defines a set of environment variables that will be available for the container. For example, you can define environment variables used by the container entrypoint script to identify data it needs to stage in.
+
+* Volumes and Mount Points
+
+    Together, **volumes** and **mountPoints** define what you would provide as using a `-v hostpath:containerpath` option to a `docker run` command.  These can be used to map host directories with resources (e.g. data or tools) used by all containers.  In the example above, a `scratch` volume is mapped so that the container can utilize a larger disk on the host.  Also, a version of the AWS CLI installed with `conda` is mapped into the container - enabling the container to have access to it (e.g. so it can transfer data from S3 and back) with out explicitly building in.
 
-!!! note
-    The `volumes` and `mountPoints` specifications allow the job container to
-    use scratch storage space on the instance it is placed on.  This is equivalent
-    to the `-v host_path:container_path` option provided to a `docker run` call
-    at the command line.
 
 ### State Machine Batch Job Tasks
 
-Conveniently for genomics workflows, AWS Step Functions has built-in integration with AWS Batch (and [several other services](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-connectors.html)), and provides snippets of code to make developing your state-machine
-Batch tasks easier.
+AWS Step Functions has built-in integration with AWS Batch (and [several other services](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-connectors.html)), and provides snippets of code to make developing your state-machine tasks easier.
 
 ![Manage a Batch Job Snippet](images/sfn-batch-job-snippet.png)
 
@@ -202,7 +258,15 @@ would look like the following:
         "JobDefinition": "arn:aws:batch:<region>:<account>:job-definition/bwa-mem:1",
         "JobName": "bwa-mem",
         "JobQueue": "<queue-arn>",
-        "Parameters.$": "$.bwa-mem.parameters"
+        "Parameters.$": "$.bwa-mem.parameters",
+        "Environment": [
+            {"Name": "REFERENCE_URI",
+             "Value.$": "$.bwa-mem.environment.REFERENCE_URI"},
+            {"Name": "INPUT_DATA_URI",
+             "Value.$": "$.bwa-mem.environment.INPUT_DATA_URI"},
+            {"Name": "OUTPUT_DATA_URI",
+             "Value.$": "$.bwa-mem.environment.OUTPUT_DATA_URI"}
+        ]
     },
     "Next": "NEXT_TASK_NAME"
 }
@@ -214,36 +278,79 @@ Inputs to a state machine that uses the above `BwaMemTask` would look like this:
 {
     "bwa-mem": {
         "parameters": {
-            "InputReferenceS3Prefix": "s3://<bucket-name/><sample-name>/reference",
-            "InputFastqS3Path1": "s3://<bucket-name/><sample-name>/fastq/read1.fastq.gz",
-            "InputFastqS3Path2": "s3://<bucket-name/><sample-name>/fastq/read2.fastq.gz",
-            "OutputS3Prefix": "s3://<bucket-name/><sample-name>/aligned"
+            "threads": 8
+        },
+        "environment": {
+            "REFERENCE_URI": "s3://<bucket-name/><sample-name>/reference/*",
+            "INPUT_DATA_URI": "s3://<bucket-name/><sample-name>/fastq/*.fastq.gz",
+            "OUTPUT_DATA_URI": "s3://<bucket-name/><sample-name>/aligned"
         }
     },
     ...
- }
+}
 ```
 
 When the Task state completes Step Functions will add information to a new `status` key under `bwa-mem` in the JSON object.  The complete object will be passed on to the next state in the workflow.
 
 ## Example state machine
 
-All of the above is created by the following CloudFormation template.
+The following CloudFormation template creates container images, AWS Batch Job Definitions, and an AWS Step Functions State Machine for a simple genomics workflow using bwa, samtools, and bcftools.
 
 | Name | Description | Source | Launch Stack |
 | -- | -- | :--: | :--: |
-{{ cfn_stack_row("AWS Step Functions Example", "SfnExample", "step-functions/sfn-example.template.yaml", "Create a Step Functions State Machine, Batch Job Definitions, and container images to run an example genomics workflow") }}
+{{ cfn_stack_row("AWS Step Functions Example", "SfnExample", "step-functions/sfn-workflow.template.yaml", "Create a Step Functions State Machine, Batch Job Definitions, and container images to run an example genomics workflow") }}
 
 !!! note
     The stack above needs to create several IAM Roles.  You must have administrative privileges in your AWS Account for this to succeed.
 
+The example workflow is a simple secondary analysis pipeline that converts raw FASTQ files into VCFs with variants called for a list of chromosomes.  It uses the following open source based tools:
+
+* `bwa-mem`: Burrows-Wheeler Aligner for aligning short sequence reads to a reference genome
+* `samtools`: **S**equence **A**lignment **M**apping library for indexing and sorting aligned reads
+* `bcftools`: **B**inary (V)ariant **C**all **F**ormat library for determining variants in sample reads relative to a reference genome
+
+Read alignment, sorting, and indexing occur sequentially by Step Functions Task States.  Variant calls for chromosomes occur in parallel using a Step Functions Map State and sub-Task States therein.  All tasks submit AWS Batch Jobs to perform computational work using containerized versions of the tools listed above.
+
+![example genomics workflow state machine](./images/sfn-example-mapping-state-machine.png)
+
+The tooling containers used by the workflow use a [generic entrypoint script]({{ repo_url + "tree/master/src/containers" }}) that wraps the underlying tool and handles S3 data staging.  It uses the AWS CLI to transfer objects and environment variables to identify data inputs and outputs to stage.
+
 ### Running the workflow
 
 When the stack above completes, go to the outputs tab and copy the JSON string provided in `StateMachineInput`.
 
 ![cloud formation output tab](./images/cfn-stack-outputs-tab.png)
 ![example state-machine input](./images/cfn-stack-outputs-statemachineinput.png)
 
+The input JSON will like the following, but with the values for `queue` and `JOB_OUTPUT_PREFIX` prepopulated with resource names specific to the stack created by the CloudFormation template above:
+
+```json
+{
+    "params": {
+        "__comment__": {
+            "replace values for `queue` and `environment.JOB_OUTPUT_PREFIX` with values that match your resources": {
+                "queue": "Name or ARN of the AWS Batch Job Queue the workflow will use by default.",
+                "environment.JOB_OUTPUT_PREFIX": "S3 URI (e.g. s3://bucket/prefix) you are using for workflow inputs and outputs."
+            },
+        },
+        "queue": "default",
+        "environment": {
+            "REFERENCE_NAME": "Homo_sapiens_assembly38",
+            "SAMPLE_ID": "NIST7035",
+            "SOURCE_DATA_PREFIX": "s3://aws-batch-genomics-shared/secondary-analysis/example-files/fastq",
+            "JOB_OUTPUT_PREFIX": "s3://YOUR-BUCKET-NAME/PREFIX",
+            "JOB_AWS_CLI_PATH": "/opt/miniconda/bin"
+        },
+        "chromosomes": [
+            "chr19",
+            "chr20",
+            "chr21",
+            "chr22"
+        ]
+    }
+}
+```
+
 Next head to the AWS Step Functions console and select the state-machine that was created.
 
 ![select state-machine](./images/sfn-console-statemachine.png)
@@ -260,4 +367,4 @@ You will then be taken to the execution tracking page where you can monitor the
 
 ![execution tracking](./images/sfn-console-execution-inprogress.png)
 
-The workflow takes approximately 5-6hrs to complete on `r4.2xlarge` SPOT instances.
+The example workflow references a small demo dataset and takes approximately 20-30 minutes to complete.
diff --git a/environment.yaml b/environment.yaml
@@ -3,8 +3,9 @@ channels:
   - defaults
 dependencies:
   - python=3.6.6
+  - pip
   - pip:
-    - cfn-lint==0.16.0
+    - cfn-lint
     - fontawesome-markdown==0.2.6
     - mkdocs==1.0.4
     - mkdocs-macros-plugin==0.2.4

diff --git a/src/containers/_common/README.md b/src/containers/_common/README.md
@@ -0,0 +1,6 @@
+# Common assets for tooling containers
+
+These are assets that are used to build all tooling containers.
+
+* `build.sh`: a generic build script that first builds a base image for a container, then builds an AWS specific image
+* `entrypoint.aws.sh`: a generic entrypoint script that wraps a call to a binary tool in the container with handlers data staging from/to S3
diff --git a/src/containers/_common/build.sh b/src/containers/_common/build.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+
+IMAGE_NAME=$1
+
+# build the base image
+docker build -t $IMAGE_NAME .
+
+# build the image with an AWS specific entrypoint
+docker build -t $IMAGE_NAME -f aws.dockerfile .