Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create self-hosted runner for integration(-ish) CI tests #75

Closed
wants to merge 31 commits into from

Conversation

safoinme
Copy link
Contributor

@safoinme safoinme commented Aug 23, 2023

Introduction

This pull request (PR) addresses a long-standing challenge we've encountered with the K3d stack recipe. Specifically, our previous testing process on GitHub Actions fell short due to the resource-intensive nature of provisioning a K3d cluster and installing various applications.

To overcome this hurdle, we've introduced a solution leveraging GitHub's Self-hosted runners. These self-hosted runners grant us the flexibility to execute GitHub Actions workloads within our own custom environments, offering greater control and adaptability.

However, we are mindful of cost considerations and the environmental impact of maintaining VMs that run continuously. To address this, we've integrated Terraform into our workflow. With Terraform, we can dynamically provision VMs only when needed for testing purposes and efficiently de-provision them once testing is complete.

This PR represents a significant improvement in our testing infrastructure, allowing us to ensure the reliability and performance of the K3d stack recipe without incurring unnecessary costs or resource wastage. We look forward to your feedback and collaboration to further enhance our development process.

A full detailed document about this can be found here

@safoinme safoinme requested review from strickvl and wjayesh August 23, 2023 12:19
@strickvl strickvl changed the title Feature/create self hosted runner Create self-hosted runner for integration(-ish) CI tests Aug 23, 2023
@strickvl strickvl added enhancement New feature or request internal tests labels Aug 23, 2023
Copy link
Contributor

@strickvl strickvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, but two bigger things:

  • I don't see where the tests get run. Is this done manually only?
  • Why does all this run in separate workflows? Why can't the deployment and destruction + testing all happen in the same workflow?

infrastructure/terraform.tf Show resolved Hide resolved
infrastructure/deploy.tf Outdated Show resolved Hide resolved
.github/workflows/deploy-test-runner.yml Outdated Show resolved Hide resolved
@safoinme
Copy link
Contributor Author

safoinme commented Aug 24, 2023

@strickvl Regarding the questions:

  • What tests exactly? if we talking about calling the provisioning of and destruction of resources. They were not called because didn't know what tests we would want to run on the environment exactly.

  • We can have them all in one workflow, However, the job that will be running the test must be changed to runs-on: self-hosted

safoinme and others added 2 commits August 24, 2023 09:50
@safoinme safoinme requested a review from strickvl August 24, 2023 08:56
@strickvl
Copy link
Contributor

@strickvl Regarding the questions:

  • What tests exactly? if we talking about calling the provisioning of and destruction of resources. They were not called because didn't know what tests we would want to run on the environment exactly.

I'd suggest you add one way to indicate how you think this should be used.

  • We can have them all in one workflow, However, the job that will be running the test must be changed to runs-on: self-hosted

Yeah it just felt a bit weird to have them running in separate workflows.

Also followup questions:

  • what's the failure fallback here? what happens when something gets partially provisioned? what happens when a test fails and/or the destruction doesn't take place?
  • how does this work when two PRs are running these tests at the same time and the resource group already exists, but maybe they're both trying to create the same resources with potentially the same names?
  • in general, am more interested in what happens / how you envision this working when things go wrong (either with github actions etc like we have with qemu at the moment) or for when tests fail and potentially we have resources partially provisioned etc.

Co-authored-by: Alex Strick van Linschoten <[email protected]>
@safoinme
Copy link
Contributor Author

safoinme commented Sep 4, 2023

@strickvl To address the questions:

  • If there is some problem within the provisioning of the VM, triggering a new run should fix the problem unless there are some changes that are causing the failure, If the tests we want to run within the VM fail the destroy will still be called and the azure resources will be deleted.
  • That's a very good question and scenario (2PRs running at the same time) that we may want to test as I don't have a clear answer as to how would it behave, I think we can add a check if provisioning of resource is done or not and react based on it, but problem with this is that we never the main run that triggered provisioning is done it will trigger destroy

@strickvl strickvl requested review from fa9r and removed request for wjayesh September 4, 2023 13:21
Copy link
Contributor

@fa9r fa9r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to take a moment to celebrate this outstanding pull request description! The attention to detail, clarity, and thoroughness in explaining the changes made is truly commendable. The description not only helps the team understand the purpose and impact of the pull request but also showcases exceptional communication skills.

Code looks all good to me, but take that with a grain of salt since I'm not too familiar with TF yet. Overall I got what the code does, but I didn't fully understand how this would be used to run tests in practice. As @strickvl suggested, I think it would make sense to add a first integration test with this PR that showcases how the custom runners would actually be used for integration tests. In particular, I would be interested in when exactly the deployment and destruction will happen, is the idea that these are called right before/after each integration test?

}

data "azurerm_image" "example" {
name = "mlstack-github-runner-machine-image-20230819162059"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't the image change over time or can this stay hardcoded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can become a variable that for now has this as a default value, and it can be configurable for sure.
However for the initial versions, there will be a manual step since GitHub didn't provide an API for self-hosted runners, The issue is that we need to configure the runner agent within the VM with a one-time given token, that is why the configuration of the agent would still remain a manual step which we build a VM image on top of to use

@safoinme safoinme requested review from strickvl and fa9r September 25, 2023 10:14
Copy link
Contributor

@fa9r fa9r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM.

Two things which I don't fully understand yet:

  • What does the Azure storage bucket do?
  • How would GH know where your self hosted runners are?

I guess these are just understanding questions from my side, so I'll approve already :)

k3d_test:
name: k3d_test
runs-on: ubuntu-latest
runs-on: self-hosted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I'm curious though, how would GitHub know where/how you have self-hosted it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The Azure storage is used as a lockdown mechanism to avoid the scenario where we have multiple runs at the same time in the VM and one finishes before the others, what the lock mechanism does is create a file with the run ID within the storage whenever a new run is called and before destroying the VM it first deletes the file that has the same run id and then does a check if there are any other files left, if none is left it allows the destroy if there is even one file left it means some run is still in progress and only last run would be allowed to destroy the VM
  • GH detects that because the VM is configured and connected to the GH action server and it has a Heartbeats test that checks that the self-hosted runner is still running and lunch the runs within that connected runner (the self-hosted is the default name to any runner so if there is multiple it will run on free one otherwise we can give specific names to each runner)

@strickvl
Copy link
Contributor

@safoinme the runner doesn't seem to run, however. Something seems missing? or I'm not sure what's going on.

@safoinme
Copy link
Contributor Author

safoinme commented Sep 25, 2023

@strickvl Yes, I was looking for the reason this morning it turns out that our token got invalidated because it wasn't used for so long, now we need to generate a new one. This is a big problem that I don't think we have a potential solution for unfortunately because there is no API to token generation, so if this happened we need to generate it manually and set it in the VM config

@strickvl
Copy link
Contributor

Now that we know how to do the self-hosted runners, should we close this branch? We have a ticket to implement integration tests which we can separately do. @safoinme WDYT?

@safoinme
Copy link
Contributor Author

I agree let's close this

@safoinme
Copy link
Contributor Author

Now we have self-hosted runners implemented with ARC on an organization level.

@safoinme safoinme closed this Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request internal tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants