Create self-hosted runner for integration(-ish) CI tests #75

safoinme · 2023-08-23T07:41:51Z

Introduction

This pull request (PR) addresses a long-standing challenge we've encountered with the K3d stack recipe. Specifically, our previous testing process on GitHub Actions fell short due to the resource-intensive nature of provisioning a K3d cluster and installing various applications.

To overcome this hurdle, we've introduced a solution leveraging GitHub's Self-hosted runners. These self-hosted runners grant us the flexibility to execute GitHub Actions workloads within our own custom environments, offering greater control and adaptability.

However, we are mindful of cost considerations and the environmental impact of maintaining VMs that run continuously. To address this, we've integrated Terraform into our workflow. With Terraform, we can dynamically provision VMs only when needed for testing purposes and efficiently de-provision them once testing is complete.

This PR represents a significant improvement in our testing infrastructure, allowing us to ensure the reliability and performance of the K3d stack recipe without incurring unnecessary costs or resource wastage. We look forward to your feedback and collaboration to further enhance our development process.

A full detailed document about this can be found here

… heavy workloads

strickvl

Some comments, but two bigger things:

I don't see where the tests get run. Is this done manually only?
Why does all this run in separate workflows? Why can't the deployment and destruction + testing all happen in the same workflow?

infrastructure/terraform.tf

infrastructure/deploy.tf

.github/workflows/deploy-test-runner.yml

safoinme · 2023-08-24T08:49:47Z

@strickvl Regarding the questions:

What tests exactly? if we talking about calling the provisioning of and destruction of resources. They were not called because didn't know what tests we would want to run on the environment exactly.
We can have them all in one workflow, However, the job that will be running the test must be changed to runs-on: self-hosted

Co-authored-by: Alex Strick van Linschoten <[email protected]>

strickvl · 2023-08-24T09:39:28Z

@strickvl Regarding the questions:

What tests exactly? if we talking about calling the provisioning of and destruction of resources. They were not called because didn't know what tests we would want to run on the environment exactly.

I'd suggest you add one way to indicate how you think this should be used.

We can have them all in one workflow, However, the job that will be running the test must be changed to runs-on: self-hosted

Yeah it just felt a bit weird to have them running in separate workflows.

Also followup questions:

what's the failure fallback here? what happens when something gets partially provisioned? what happens when a test fails and/or the destruction doesn't take place?
how does this work when two PRs are running these tests at the same time and the resource group already exists, but maybe they're both trying to create the same resources with potentially the same names?
in general, am more interested in what happens / how you envision this working when things go wrong (either with github actions etc like we have with qemu at the moment) or for when tests fail and potentially we have resources partially provisioned etc.

.github/workflows/deploy-test-runner.yml

.github/workflows/destroy-test-runner.yml

Co-authored-by: Alex Strick van Linschoten <[email protected]>

safoinme · 2023-09-04T13:20:02Z

@strickvl To address the questions:

If there is some problem within the provisioning of the VM, triggering a new run should fix the problem unless there are some changes that are causing the failure, If the tests we want to run within the VM fail the destroy will still be called and the azure resources will be deleted.
That's a very good question and scenario (2PRs running at the same time) that we may want to test as I don't have a clear answer as to how would it behave, I think we can add a check if provisioning of resource is done or not and react based on it, but problem with this is that we never the main run that triggered provisioning is done it will trigger destroy

.github/workflows/deploy-test-runner.yml

fa9r

I want to take a moment to celebrate this outstanding pull request description! The attention to detail, clarity, and thoroughness in explaining the changes made is truly commendable. The description not only helps the team understand the purpose and impact of the pull request but also showcases exceptional communication skills.

Code looks all good to me, but take that with a grain of salt since I'm not too familiar with TF yet. Overall I got what the code does, but I didn't fully understand how this would be used to run tests in practice. As @strickvl suggested, I think it would make sense to add a first integration test with this PR that showcases how the custom runners would actually be used for integration tests. In particular, I would be interested in when exactly the deployment and destruction will happen, is the idea that these are called right before/after each integration test?

fa9r · 2023-09-04T14:27:54Z

infrastructure/deploy.tf

+}
+
+data "azurerm_image" "example" {
+  name                = "mlstack-github-runner-machine-image-20230819162059"


wouldn't the image change over time or can this stay hardcoded?

I think this can become a variable that for now has this as a default value, and it can be configurable for sure.
However for the initial versions, there will be a manual step since GitHub didn't provide an API for self-hosted runners, The issue is that we need to configure the runner agent within the VM with a one-time given token, that is why the configuration of the agent would still remain a manual step which we build a VM image on top of to use

…io/mlops-stacks into feature/create-self-hosted-runner

fa9r

Nice, LGTM.

Two things which I don't fully understand yet:

What does the Azure storage bucket do?
How would GH know where your self hosted runners are?

I guess these are just understanding questions from my side, so I'll approve already :)

fa9r · 2023-09-25T13:13:20Z

.github/workflows/ci.yml

  k3d_test:
    name: k3d_test
-    runs-on: ubuntu-latest
+    runs-on: self-hosted


Cool. I'm curious though, how would GitHub know where/how you have self-hosted it?

The Azure storage is used as a lockdown mechanism to avoid the scenario where we have multiple runs at the same time in the VM and one finishes before the others, what the lock mechanism does is create a file with the run ID within the storage whenever a new run is called and before destroying the VM it first deletes the file that has the same run id and then does a check if there are any other files left, if none is left it allows the destroy if there is even one file left it means some run is still in progress and only last run would be allowed to destroy the VM

GH detects that because the VM is configured and connected to the GH action server and it has a Heartbeats test that checks that the self-hosted runner is still running and lunch the runs within that connected runner (the self-hosted is the default name to any runner so if there is multiple it will run on free one otherwise we can give specific names to each runner)

strickvl · 2023-09-25T13:20:07Z

@safoinme the runner doesn't seem to run, however. Something seems missing? or I'm not sure what's going on.

safoinme · 2023-09-25T13:30:19Z

@strickvl Yes, I was looking for the reason this morning it turns out that our token got invalidated because it wasn't used for so long, now we need to generate a new one. This is a big problem that I don't think we have a potential solution for unfortunately because there is no API to token generation, so if this happened we need to generate it manually and set it in the VM config

strickvl · 2024-01-17T09:31:24Z

Now that we know how to do the self-hosted runners, should we close this branch? We have a ticket to implement integration tests which we can separately do. @safoinme WDYT?

safoinme · 2024-01-17T13:11:52Z

I agree let's close this

safoinme · 2024-01-17T13:12:29Z

Now we have self-hosted runners implemented with ARC on an organization level.

safoinme added 11 commits August 18, 2023 08:27

change recipe-test to test the runner

2630e05

change recipe-test to test the runner

fc4b378

add clone repo step

5a576e7

remove cloning step

60e48df

initial code for creating a self-hosted runner on an azure vm to test…

1ea2aa8

… heavy workloads

add destroy

2b4b2bc

fix repo url

71919d5

add terraform backend to store the state

329bb10

change image

28f6d64

return k3d-test to default runner

4c9efb8

Merge branch 'develop' into feature/create-self-hosted-runner

ad7b484

safoinme requested review from strickvl and wjayesh August 23, 2023 12:19

Merge branch 'develop' into feature/create-self-hosted-runner

862935d

strickvl changed the title ~~Feature/create self hosted runner~~ Create self-hosted runner for integration(-ish) CI tests Aug 23, 2023

strickvl added enhancement New feature or request internal tests labels Aug 23, 2023

strickvl reviewed Aug 23, 2023

View reviewed changes

infrastructure/terraform.tf Show resolved Hide resolved

infrastructure/deploy.tf Outdated Show resolved Hide resolved

.github/workflows/deploy-test-runner.yml Outdated Show resolved Hide resolved

Merge branch 'develop' into feature/create-self-hosted-runner

35114c2

safoinme and others added 2 commits August 24, 2023 09:50

Update infrastructure/terraform.tf

f3377e6

Co-authored-by: Alex Strick van Linschoten <[email protected]>

apply suggested reviews

6abcbb1

safoinme requested a review from strickvl August 24, 2023 08:56

strickvl added 3 commits August 28, 2023 14:04

Merge branch 'develop' into feature/create-self-hosted-runner

2a6e8d0

Merge branch 'develop' into feature/create-self-hosted-runner

b0d112b

Merge branch 'develop' into feature/create-self-hosted-runner

923150d

strickvl reviewed Aug 30, 2023

View reviewed changes

.github/workflows/deploy-test-runner.yml Outdated Show resolved Hide resolved

strickvl reviewed Aug 30, 2023

View reviewed changes

.github/workflows/destroy-test-runner.yml Outdated Show resolved Hide resolved

Apply suggestions from code review

7cdb910

Co-authored-by: Alex Strick van Linschoten <[email protected]>

Merge branch 'develop' into feature/create-self-hosted-runner

9bbb119

strickvl requested review from fa9r and removed request for wjayesh September 4, 2023 13:21

fa9r reviewed Sep 4, 2023

View reviewed changes

.github/workflows/deploy-test-runner.yml Outdated Show resolved Hide resolved

fa9r reviewed Sep 4, 2023

View reviewed changes

safoinme and others added 9 commits September 24, 2023 15:07

Merge branch 'develop' into feature/create-self-hosted-runner

9ad516b

try new workflow to run on self-hosted runner

3184c72

Merge branch 'feature/create-self-hosted-runner' of github.com:zenml-…

56cfc4d

…io/mlops-stacks into feature/create-self-hosted-runner

format

e77ac3a

fix destory yml

fc46012

fix deploy yml

f88cf99

add tags to resource groups

7d30049

update blob write and check

83734ba

update blob write and check

0e8d71d

safoinme requested review from strickvl and fa9r September 25, 2023 10:14

fa9r approved these changes Sep 25, 2023

View reviewed changes

safoinme and others added 2 commits October 5, 2023 08:41

Merge branch 'develop' into feature/create-self-hosted-runner

8c512b8

Merge branch 'develop' into feature/create-self-hosted-runner

7ca13a3

safoinme closed this Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create self-hosted runner for integration(-ish) CI tests #75

Create self-hosted runner for integration(-ish) CI tests #75

safoinme commented Aug 23, 2023 •

edited

Loading

strickvl left a comment

safoinme commented Aug 24, 2023 •

edited

Loading

strickvl commented Aug 24, 2023

safoinme commented Sep 4, 2023

fa9r left a comment •

edited

Loading

fa9r Sep 4, 2023

safoinme Sep 4, 2023

fa9r left a comment

fa9r Sep 25, 2023

safoinme Sep 25, 2023

strickvl commented Sep 25, 2023

safoinme commented Sep 25, 2023 •

edited

Loading

strickvl commented Jan 17, 2024

safoinme commented Jan 17, 2024

safoinme commented Jan 17, 2024

Create self-hosted runner for integration(-ish) CI tests #75

Create self-hosted runner for integration(-ish) CI tests #75

Conversation

safoinme commented Aug 23, 2023 • edited Loading

Introduction

strickvl left a comment

Choose a reason for hiding this comment

safoinme commented Aug 24, 2023 • edited Loading

strickvl commented Aug 24, 2023

safoinme commented Sep 4, 2023

fa9r left a comment • edited Loading

Choose a reason for hiding this comment

fa9r Sep 4, 2023

Choose a reason for hiding this comment

safoinme Sep 4, 2023

Choose a reason for hiding this comment

fa9r left a comment

Choose a reason for hiding this comment

fa9r Sep 25, 2023

Choose a reason for hiding this comment

safoinme Sep 25, 2023

Choose a reason for hiding this comment

strickvl commented Sep 25, 2023

safoinme commented Sep 25, 2023 • edited Loading

strickvl commented Jan 17, 2024

safoinme commented Jan 17, 2024

safoinme commented Jan 17, 2024

safoinme commented Aug 23, 2023 •

edited

Loading

safoinme commented Aug 24, 2023 •

edited

Loading

fa9r left a comment •

edited

Loading

safoinme commented Sep 25, 2023 •

edited

Loading