Skip to content

Commit

Permalink
Merge pull request #169 from icgc-argo/rc/3.0.0
Browse files Browse the repository at this point in the history
Rc/3.0.0
  • Loading branch information
andricDu authored Jul 19, 2021
2 parents 2020576 + 503f496 commit e8b86bf
Show file tree
Hide file tree
Showing 83 changed files with 2,714 additions and 1,517 deletions.
17 changes: 9 additions & 8 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ spec:
- name: dind-daemon
image: docker:18.06-dind
securityContext:
privileged: true
privileged: true
runAsUser: 0
volumeMounts:
- name: docker-graph-storage
mountPath: /var/lib/docker
Expand All @@ -34,14 +35,14 @@ spec:
- name: docker
image: docker:18-git
tty: true
volumeMounts:
- mountPath: /var/run/docker.sock
name: docker-sock
env:
- name: DOCKER_HOST
value: tcp://localhost:2375
- name: HOME
value: /home/jenkins/agent
securityContext:
runAsUser: 1000
volumes:
- name: docker-sock
hostPath:
path: /var/run/docker.sock
type: File
- name: docker-graph-storage
emptyDir: {}
"""
Expand Down
159 changes: 158 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,158 @@
# workflow-management
# Workflow Management

[![License: AGPL v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)

This service is responsible for managing the lifecycle of a workflow execution within the larger Workflow Execution System (WES).
At it's simplest, Workflow Management takes a workflow request and initiates a run using the requested workflow engine (currently only supporting Nextflow).
However, it is possible to configure Workflow Management to go beyond being a direct initiator and instead it can be configured to queue runs
which can then be processed by various "middleware" before finally hitting the initiator, more on this below. In addition to queuing and initiating,
Workflow Management also handles cancelling runs and ensuring that run state transitions are valid (ie. a late received duplicate message to cancel a run
will be ignored as the run is already in a state of `CANCELLING` or `CANCELLED` at that point)

Workflow Management is part of the larger Workflow Execution Service which comprises multiple services. Given that Workflow Management is composed itself of
potentially multiple instances running different components (managed with Spring profiles) and potentially multiple middleware services, we refer to this more
broadly as the Workflow Management Domain.

![Workflow Management Domain](docs/wes-high-level-diagram.jpeg)

The above diagram illustrates the Workflow Management domain, and it's position in the overall WES service architecture. This README will attempt to illuminate
the main concepts employed by Workflow Management as well as summarize the technologies used and how to get up and running with it as smoothly as possible.

## Tech Stack
- Java 11
- RabbitMQ 3.X
- PostgreSQL 13
- [Spring Reactive Stack](https://docs.spring.io/spring-framework/docs/current/reference/html/web-reactive.html)
- [Reactor RabbitMQ Streams](https://pivotal.github.io/reactor-rabbitmq-streams/docs/current/)
- [Apache Avro](https://avro.apache.org/)

## Workflow Run Lifecycle and State Diagram

Workflow runs can be in a number of states as they make their way through the WES, these states can be updated by user-action (starting, cancelling),
by events in the Workflow Management domain (going from queued to initiating, cancelling to cancelled, etc), and from external events from the workflow
execution engine itself (Nextflow at the present time).

![WES States Diagram](docs/WES%20States%20and%20Transitions.png)

### High Level Flow (RabbitMQ Queues and How They Work)

Communication between the services in the management domain will be backed by RabbitMQ. This is because these messages don't need to be stored permanently,
we have selective message acknowledgement, once delivery, and clear transaction support. Messages in the RabbitMQ flow will also not pollute the journal in
the workflow-relay domain, ensuring separation of concerns.

![Management High Level Flow](docs/management-high-level-flow.jpeg)

In addition to the separation of concerns between decision-making (transitioning states) and journaling, this approach is especially beneficial to expanding
on the functionality of the Management domain with "middleware services", something that is explained in greater detail further down this readme.

To keep things orderly and sane, Management has a single message schema defined in using Apache AVRO, that it and all future middleware must use in order to
pass messages, a Lingua Franca of sorts.

##### Workflow Management Message Schema

```
WorkflowManagementMessage {
@NonNull String runId;
@NonNull WorkflowState state;
@NonNull String workflowUrl;
@NonNull String utcTime;
String workflowType;
String workflowTypeVersion;
Map<String, Object> workflowParams;
EngineParameters workflowEngineParams;
}
EngineParameters {
String defaultContainer;
String launchDir;
String projectDir;
String workDir;
String revision;
String resume;
String latest;
}
enum WorkflowState {
UNKNOWN("UNKNOWN"),
QUEUED("QUEUED"),
INITIALIZING("INITIALIZING"),
RUNNING("RUNNING"),
PAUSED("PAUSED"),
CANCELING("CANCELING"),
CANCELED("CANCELED"),
COMPLETE("COMPLETE"),
EXECUTOR_ERROR("EXECUTOR_ERROR"),
SYSTEM_ERROR("SYSTEM_ERROR");
}
```

#### Workflow State Transition Verifier

The "State Transition Verifier" component, which is backed by a cloud friendly storage (PSQL), is as a temporary ledger for workflow runs in management giving us a
way to lock/synchronize them. Following the state transition flow, workflows enter the ledger as `QUEUED` and leave the ledger once in one of the terminal states.
Only valid transitions in workflow state is allowed to occur so only those will be executed and/or sent to the relay-domain for journaling/indexing.

![State Transition Diagram](docs/state-transition-diagram.jpeg)

#### Workflow Management Run State

In order for the state verifier to work correctly, a centralized record of runs, and their last valid state, will be maintained in a database (PostgreSQL).
This state will be used as the ground truth when making any decisions regarding state transitions within Workflow Management and as the backing repository
for the Workflow Management API (TBD but this will be available to any "middleware" services as well).

## Middleware(ish) Service Support

Workflow Management approaches to concept of middleware a little differently than a more traditional application might. Let's first look at the default case of Management
running with no middleware whatsoever.

![Management - Standalone](docs/Managment%20Standalone.png)

In this case, there is no concept of queuing because there would be no difference between messages being set to `QUEUED` vs `INITIALIZING` as there is no action for any service to
take between those states. The diagram shows two distinct Management deploys but this is really only for getting the point across, this would likely be a single Management instance with
the right profiles enabled.

What if you did want to have something managing a queue, or processing templated requests, or sending notification on certain state transitions. We allow for this by enabling an
architecture where standing up services that speak to Management via it's API's is able to transition runs to any number of intermediary states `QUEUED` and `INITIALIZING`
(TBD, currently we only allow the base WES states however middleware would still be able to take any number of action before completing the transition).

![Management - Middleware(ish)](docs/Managment%20with%20Middleware.png)

In the diagram above you can see that the flow basically works as described. Management receives the request and queues the run immediately. A middleware service is listening on the `state.QUEUED` topic,
picks up the message for processing, and once complete sends a request to Management to initialize the run (with the updates state if applicable). Management then continues from this point just as if
it was running in standalone mode. In the future once the state transitions are configurable, any number of middleware can run in sequence or even in parallel as long as the ultimately get the message
back to management to initialize the run.

## Modes

TBD - Describe various run profiles (modes), what they do, etc.

## Build

With Maven:
```bash
mvn clean package
```

With Docker:
```bash
docker build .
```

## Run

The uber jar can be run as a spring boot application with the following command:
```bash
java -jar target/workflow-management.jar
```

Or with docker:
```bash
docker run ghcr.io/icgc-argo/workflow-management
```

## Test

```bash
mvn clean test
```
37 changes: 37 additions & 0 deletions docker-compose-gatekeeper.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
version: '3'

services:
gatekeeper-db:
image: docker.io/bitnami/postgresql:10-debian-10
environment:
POSTGRES_DB: gatekeeperdb
POSTGRES_PASSWORD: mysecretpassword
expose:
- "5432"
ports:
- "5432:5432"
rabbitmq:
image: bitnami/rabbitmq:3.8.9
environment:
RABBITMQ_USERNAME: user
RABBITMQ_PASSWORD: pass
ports:
- 5672:5672
- 15672:15672
zookeeper:
image: wurstmeister/zookeeper
ports:
- "2181:2181"
broker:
image: wurstmeister/kafka:2.13-2.6.0
hostname: broker
container_name: broker
depends_on:
- zookeeper
ports:
- "9092:9092"
environment:
KAFKA_ADVERTISED_HOST_NAME: localhost
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
volumes:
- /var/run/docker.sock:/var/run/docker.sock
Binary file added docs/Managment Standalone.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/Managment with Middleware.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/WES States and Transitions.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/management-high-level-flow.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/state-transition-diagram.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/wes-high-level-diagram.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
88 changes: 70 additions & 18 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,21 @@
</parent>
<groupId>org.icgc.argo</groupId>
<artifactId>workflow-management</artifactId>
<version>2.11.0</version>
<version>3.0.0</version>
<name>workflow-management</name>
<description>ARGO Workflow Management</description>

<repositories>
<repository>
<id>rabbitmq-maven-milestones</id>
<url>https://packagecloud.io/rabbitmq/maven-milestones/maven2</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
<repository>
<id>springfox</id>
<name>springfox</name>
Expand All @@ -29,6 +39,7 @@
<properties>
<java.version>11</java.version>
<springfox.version>3.0.0-SNAPSHOT</springfox.version>
<reactor-rabbitmq-streams.version>0.0.7</reactor-rabbitmq-streams.version>
</properties>

<dependencies>
Expand Down Expand Up @@ -66,23 +77,6 @@
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

<!-- SpringFox dependencies (SWAGGER) -->
<dependency>
<groupId>io.springfox</groupId>
<artifactId>springfox-swagger2</artifactId>
<version>${springfox.version}</version>
</dependency>
<dependency>
<groupId>io.springfox</groupId>
<artifactId>springfox-spring-webflux</artifactId>
<version>${springfox.version}</version>
</dependency>
<dependency>
<groupId>io.springfox</groupId>
<artifactId>springfox-swagger-ui</artifactId>
<version>${springfox.version}</version>
</dependency>

<dependency>
<groupId>io.fabric8</groupId>
<artifactId>kubernetes-client</artifactId>
Expand Down Expand Up @@ -136,10 +130,68 @@
<artifactId>federation-graphql-java-support</artifactId>
<version>0.5.0</version>
</dependency>

<!-- RabbitMQ -->
<dependency>
<groupId>com.pivotal.rabbitmq</groupId>
<artifactId>reactor-rabbitmq-streams-autoconfigure</artifactId>
</dependency>

<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>42.2.18</version>
</dependency>
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>postgresql</artifactId>
<version>1.15.2</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-stream-kafka</artifactId>
<version>3.0.1.RELEASE</version>
</dependency>
</dependencies>

<dependencyManagement>
<dependencies>
<dependency>
<groupId>com.pivotal.rabbitmq</groupId>
<artifactId>reactor-rabbitmq-streams</artifactId>
<version>${reactor-rabbitmq-streams.version}</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>

<build>
<plugins>
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>1.10.1</version>
<executions>
<execution>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
</goals>
<configuration>
<sourceDirectory>${project.basedir}/src/main/resources/avro/</sourceDirectory>
<outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
<!-- Set stringType to generate Java String instead of CharSequence -->
<stringType>String</stringType>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2020 The Ontario Institute for Cancer Research. All rights reserved
* Copyright (c) 2021 The Ontario Institute for Cancer Research. All rights reserved
*
* This program and the accompanying materials are made available under the terms of the GNU Affero General Public License v3.0.
* You should have received a copy of the GNU Affero General Public License along with
Expand Down
Loading

0 comments on commit e8b86bf

Please sign in to comment.