Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Update proposal 2 with design changes #143

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 102 additions & 75 deletions docs/proposals/proposal-002-run.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ This is step 2 from the automated pipeline to evaluate the carbon emissions of a
- @locomundo
- @nikimanoledaki
- @AntonioDiTuri
- @rossf7

## Status

Expand Down Expand Up @@ -82,8 +83,8 @@ know that this has succeeded?
-->

- Describe the actions to take immediately after the trigger and deployment of the CNCF project defined in [Proposal 1](./proposal-001-trigger-and-deploy.md)
- Describe how the pipeline should _fetch_ the benchmark jobs either from this repository (`cncf-tags/green-reviews-tooling`) or from an upstream repository (e.g. Falco's [`falcosecurity/cncf-green-review-testing`](https://github.com/falcosecurity/cncf-green-review-testing)).
- Describe how the pipeline should _run_ the benchmark workflows through GitHub Actions for a specific project e.g. Falco
- Describe how the pipeline should _fetch_ the benchmarks either from this repository (`cncf-tags/green-reviews-tooling`) or from an upstream repository (e.g. Falco's [`falcosecurity/cncf-green-review-testing`](https://github.com/falcosecurity/cncf-green-review-testing)).
- Describe how the pipeline should _run_ the benchmarks through GitHub Actions for a specific project e.g. Falco
- Communicate to CNCF Projects interested in a Green Review what is the structure they need to comply in the process of creating a new benchmark job
- Provide _modularity_ for the benchmark tests.

Expand Down Expand Up @@ -164,108 +165,134 @@ change are understandable. This may include manifests or workflow examples
about HOW your proposal will be implemented, this is the place to discuss them.
-->

The Green Reviews automated pipeline relies on putting together different reusable GitHub Action workflows to modularise the different moving parts. It would be helpful to familiarise oneself with the following documentation:
- [GitHub Action workflows](https://docs.github.com/en/actions/using-workflows/about-workflows) - in summary, a workflow runs one or more jobs, and each job runs one or more actions.
- [Reusing workflows](https://docs.github.com/en/actions/using-workflows/reusing-workflows)
- [Calling reusable workflows](https://docs.github.com/en/actions/using-workflows/reusing-workflows#calling-a-reusable-workflow)

### Definitions

There are different components defined here and shown in the following diagram.

![Green Reviews pipeline components](diagrams/green-reviews-pipeline-components.png "Green Reviews pipeline components")
```mermaid
---
title: Proposal 002 Run
---
stateDiagram-v2

getLatestReleases: GetLatestReleases()
projDispatch: DispatchProjects()
k8sCluster: Equinix K8s Cluster (k3s)

state "GH Workflow Falco" as falcoPipeline {
falcoInstallManifests: DeployFalco()
falcoDestroyManifests: UninstallFalco()
falcoStartBenchmarking: DeployBenchmarking()
falcoWaitBenchmarking: WaitBenchmarkingDuration()
falcoEndBenchmarking: StopBenchmarking()

falcoInstallManifests --> falcoStartBenchmarking: Start Synthetic Workload
falcoStartBenchmarking --> falcoWaitBenchmarking: Wait duration of benchmark
falcoWaitBenchmarking --> falcoEndBenchmarking: Destroy benchmarking resources
falcoEndBenchmarking --> falcoDestroyManifests: Uninstall Falco
}
state "GH Workflow Project [1:N]" as projNPipeline {
projNInstallManifests: DeployProject()
projNDestroyManifests: UninstallProject()
projNStartBenchmarking: DeployBenchmarking()
projNWaitBenchmarking: WaitBenchmarkingDuration()
projNEndBenchmarking: StopBenchmarking()

projNInstallManifests --> projNStartBenchmarking: Start Synthetic Workload
projNStartBenchmarking --> projNWaitBenchmarking: Wait duration of benchmark
projNWaitBenchmarking --> projNEndBenchmarking: Destroy benchmarking resources
projNEndBenchmarking --> projNDestroyManifests: Uninstall Project
}

state "(Github) CNCF Projects" as cncfProjs {
falco: falcosecurity/falco
project_[2]
project_[N]
}

[*] --> getLatestReleases: Trigger Cron @daily
getLatestReleases --> projDispatch: DetailOfProjects

getLatestReleases --> cncfProjs: GET /releases/latest
cncfProjs --> getLatestReleases: [{"tag"="x.y.z"},...]

projDispatch --> falcoPipeline: POST /workflows/dispatch
projDispatch --> projNPipeline: POST /workflows/dispatch


falcoPipeline --> k8sCluster
projNPipeline --> k8sCluster
%% k8sCluster --> falcoPipeline
%% k8sCluster --> projNPipeline
state join_state <<join>>
falcoPipeline --> join_state
projNPipeline --> join_state
```


Let's recap some of the components defined in [Proposal 1](proposal-001-trigger-and-deploy.md):
1. **Green Reviews pipeline**: the Continuous Integration pipeline which deploys a CNCF project to a test cluster, runs a set of benchmarks while measuring carbon emissions and stores the results. It is implemented by the workflows listed below.
2. **Cron workflow**: This refers to the initial GitHub Action workflow (described in proposal 1) and which dispatches a project workflow (see next definition), as well as a delete workflow to clean up the resources created by the project workflow.
3. **Project workflow**: The project workflow is dispatched by the Cron workflow. A project workflow can be, for example, a Falco workflow. A project workflow deploys the project and calls the benchmark workflow (see below). A project workflow can be dispatched more than once if there are multiple project variants/setups. In addition, a Project workflow, which is also just another GitHub Action workflow, contains a list of GitHub Action Jobs.
3. **Project workflow**: The project workflow is dispatched by the Cron workflow. A project workflow can be, for example, a Falco workflow. A project workflow deploys the project and runs the benchmarks (see below). A project workflow can be dispatched more than once if there are multiple project variants/setups. In addition, a Project workflow, which is also just another GitHub Action workflow, contains a list of GitHub Action Jobs.
4. **Delete/cleanup workflow**: This is the one to make sure that the resources created by the project workflow are deleted so the environments go back to the initial state.

This proposal adds the following components:
5. **[new] Benchmark workflow**: A list of benchmark jobs that needs to be run in parallel. A benchmark workflow has a `1:many` relationship with benchmark jobs.
6. **[new] Benchmark job**: A list of benchmark instructions that are executed on the cluster. A benchmark job is an instance of a GitHub Action Job. Which benchmark test to run is defined by inputs in the calling workflow: a CNCF project and a variant.

### Calling the benchmark workflow

When the project workflow starts, it deploys the project on the test environment and then runs the test job. For modularity and/or clarity, the benchmark job could be defined in two different ways:

As a job that calls another GitHub Action workflow (yes, yet another workflow 🙂) that contains the instructions. The workflow can be either:
1. Internal: In the Green Reviews WG repository (**preferred**)
2. External: In a separate repository, such as an upstream CNCF project repository
The two use cases for defining a benchmark workflow are illustrated below.

![Calling the benchmark job](diagrams/calling-benchmark-job.png "Calling the benchmark job")

This section defines _benchmark workflow_ and _benchmark job_. It describes how to run them from the _project workflow_. It dives deeper into the following:

* How a benchmark workflow should be called from the project workflow
* What a benchmark workflow must contain in order to run on the cluster

At a bare minimum, the benchmark workflow must contain a benchmark job with steps of what should run in the Kubernetes cluster. For example, the Falco project maintainers have identified that one way to test the Falco project is through a test that runs `stress-ng` for a given period of time. The steps are contained in a Deployment manifest which can be directly applied to the community cluster using `kubectl`

The benchmark workflows will be stored in the same JSON file as the other parameters for CNCF projects as defined in [Proposal 1](./proposal-001-trigger-and-deploy.md). It can be added as an additional input.

```yaml
# .github/workflows/benchmark-pipeline.yaml
jobs:
# first, must authenticate to the Kubernetes cluster
# this is a benchmark job
benchmark-job:
# benchmark job calls on benchmark workflow
uses: ${{ inputs.benchmark_path }} # refers to benchmark workflow path
5. **[new] Benchmark job**: a GitHub Actions job that applies the benchmark manifest using `kubectl apply -f`, waits the duration of the benchmark and deletes the manifest resources with `kubectl delete -f`.
6. **[new] Benchmark manifest**: A YAML file with the Kubernetes resources such as Deployments that deploy the benchmarking workload.

The manifest URL and benchmarking duration are configured via the [projects.json](../projects/projects.json).

```json
{
"projects": [
{
"name": "falco",
"organization": "falcosecurity",
"benchmark": {
"k8s_manifest_url": "https://raw.githubusercontent.com/falcosecurity/cncf-green-review-testing/e93136094735c1a52cbbef3d7e362839f26f4944/benchmark-tests/falco-benchmark-tests.yaml",
"duration_mins": 15
},
"configs": [
"ebpf",
"modern-ebpf",
"kmod"
]
}
]
}
```

This will fetch the workflow using the `jobs.<job_id>.uses` syntax defined [here](https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_iduses).
### Benchmark job

Below are two use cases: the benchmark workflow may be defined in the Green Reviews repository or in a separate repository.
The benchmark job applies the manifest using kubectl. The functional unit test is time-bound in the case of Falco and scoped to 15 minutes. Therefore, we deploy this test, wait for 15 minutes, then delete the manifest to end the loop. The test steps depend on the functional unit of each CNCF project. The wait duration is configurable via the `duration_mins` field in the project.json.

#### Use Case 1: A GitHub Action job using a workflow defined in the _same_ repository (preferred)
The benchmark job is also responsible for deleting the manifests either after the wait duration or sooner if an error has occurred.

If the benchmark workflow is located in the Green Reviews repository, `benchmark_path` would refer to, for example, `cncf-tags/green-reviews-tooling/.github/workflows/falco-benchmark-workflow.yml@v1`.
### Benchmark manifest

In terms of the directory structure, in the `green-reviews-tooling` repository, we could create a subfolder such as `./github/workflows/benchmark-workflows` to contain the benchmark workflows.
At a bare minimum, the benchmark manifest must contain Kubernetes resources for what should run in the Kubernetes cluster and which namespace should be used. For example, the Falco project maintainers have identified that one way to test the Falco project is through a test that runs `stress-ng` for a given period of time. The steps are contained in a Deployment manifest which is directly applied to the community cluster using `kubectl`

#### Use Case 2: A GitHub Action job using a workflow defined in a _different_ repository
Below are two use cases: the benchmark manifests may be defined in the Green Reviews repository or in a separate repository.

We want to accommodate different methods of setting up the tests depending on the CNCF project. Given this, the benchmark workflow containing the benchmark job could be defined in a different repository. In this case, the `benchmark_path` would be, for example, `falcosecurity/cncf-green-review-testing/.github/workflows/workflow.yml@v1`.
#### Use Case 1: Benchmark manifest is defined in the _same_ repository (preferred)

![Pipeline run](diagrams/pipeline-run.png "An example pipeline run")
Hosting the manifests in the Green Reviews repository is preferred for both simplicity and security. This is also preferred for generic benchmarks that can apply to multiple CNCF projects.

### Benchmark jobs
#### Use Case 2: Benchmark manifest is defined in a _different_ repository

The benchmark workflow which contains the benchmark jobs and their test steps may look like the following:
We want to accommodate different methods of setting up the tests depending on the CNCF project. Given this, the benchmark manifest could be defined in a different repository. In this case, the `k8s_manifest_url` would be, for example, `https://raw.githubusercontent.com/falcosecurity/cncf-green-review-testing/e93136094735c1a52cbbef3d7e362839f26f4944/benchmark-tests/falco-benchmark-tests.yaml`.

```yaml
# .github/workflows/tests/falco-benchmark-workflow.yaml
jobs:
stress-ng-test:
runs-on: ubuntu-latest
steps:
- run: |
# the action to take here depends on the Functional Unit of the CNCF project. wait for amount of time, for resources
kubectl apply -f https://raw.githubusercontent.com/falcosecurity/cncf-green-review-testing/main/kustomize/falco-driver/ebpf/stress-ng.yaml
# the one above is a workflow with a single benchmark job, but if your workflow needs multiple benchmark job, it is enough to define additional steps:
# e.g. for redis-test: kubectl apply -f https://github.com/falcosecurity/cncf-green-review-testing/blob/main/kustomize/falco-driver/ebpf/redis.yaml
# for event-generator-test -> kubectl apply -f https://github.com/falcosecurity/cncf-green-review-testing/blob/main/kustomize/falco-driver/ebpf/falco-event-generator.yaml
wait 15m
- delete: |
kubectl delete -f https://raw.githubusercontent.com/falcosecurity/cncf-green-review-testing/main/kustomize/falco-driver/ebpf/stress-ng.yaml # other Falco tests:
```
Applying manifests from a different repository not controlled by Green Reviews is a potential security risk. See next section.

The benchmark job has some test instructions/steps. In this case, it applies an upstream Kubernetes manifest. This manifest contains a `while` loop that runs `stress-ng`. The manifest already defines where the test should run in the cluster i.e. in which namespace. The functional unit test is time-bound in this case and scoped to 15 minutes. Therefore, we deploy this test, wait for 15 minutes, then delete the manifest to end the loop. The test steps depend on the functional unit of each CNCF project.
### Versioning / Security

In the example above, the Kubernetes manifest that is applied to the cluster is located in a different repository: this is the case of an externally defined benchmark

Each workflow should ensure that any artefacts that were deployed as part of the benchmark job should be deleted at the end of the test run.
Manifests in `project.json` are pinned to a Git commit SHA rather than a branch such as `main`. This mitigates the risk that a malicious workload could be included in the benchmark manifest and ensures that any changes to the manifests are reviewed by one of the Green Reviews maintainers.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leonardpahlke @nikimanoledaki @dipankardas011 FYI I added this note on pinning manifests to SHAs following our discussion in last weeks meeting.

LMK if more detail is needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add some seccomp profiles, quota limits on the namespace we are dealing with so that we don't have impact on other monitoring workloads, also we should perform some sort of image scaning before begining to start benchmarking to safeguard from any potential problems

anyways let me know your thoughts :)
I was just checking out CIS benchmark for any k8s cluster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dipankardas011 Those are good recommendations for securing the cluster but I think we should keep this focused on the pipeline.

The future of the cluster is uncertain. If we only provision on-demand the design is likely to be different to a permanent cluster even a single node one. So I would not invest time until the design is clear.

However your approach of being guided by the CIS benchmark is sound. It's also a good point that by running both the project components and the benchmarks in the benchmark namespace we can restrict its access.


### Authentication

Before the benchmark workflow is called on, we assume that the workflow already contains a secret with a kubeconfig to authenticate with the test cluster and Falco has already been deployed to it. It is required that the pipeline authenticates with the Kubernetes cluster before running the job with the test.

### Versioning

For versioning, this syntax can be configured to use `@main` or `@another-branch` which would be nice for versioning and testing specific releases.

## Drawbacks (Optional)

<!--
Expand All @@ -278,9 +305,9 @@ information to express the idea and why it was not acceptable.

Here a list of the alternatives we considered:

- **mapping between workflows and CNCF projects**: we have decided for a 1:1 relationship, every project will only have one workflow, again for simplicity. We could add support for 1:many in the future
- **calling benchmarks as reusable GitHub Actions workflows**: was originally selected but calling workflows with the [uses](https://docs.github.com/en/actions/sharing-automations/reusing-workflows#calling-a-reusable-workflow) directive does not support using parameterized values.

- **mapping between workflows and jobs**: we have decided a 1:Many relationship, 1 workflow and many jobs, but a different option we evaluated was a 1:1 relationship. We choose for the first option cause it is simpler and gives a clear overview about what jobs are needed for a project workflow
- **mapping between benchmark manifests and CNCF projects**: we have decided for a 1:1 relationship, every project will only have one benchmark manifest, again for simplicity. We could add support for 1:many in the future

## Infrastructure Needed (Optional)

Expand Down