-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create self-hosted runner for integration(-ish) CI tests #75
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, but two bigger things:
- I don't see where the tests get run. Is this done manually only?
- Why does all this run in separate workflows? Why can't the deployment and destruction + testing all happen in the same workflow?
@strickvl Regarding the questions:
|
Co-authored-by: Alex Strick van Linschoten <[email protected]>
I'd suggest you add one way to indicate how you think this should be used.
Yeah it just felt a bit weird to have them running in separate workflows. Also followup questions:
|
Co-authored-by: Alex Strick van Linschoten <[email protected]>
@strickvl To address the questions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to take a moment to celebrate this outstanding pull request description! The attention to detail, clarity, and thoroughness in explaining the changes made is truly commendable. The description not only helps the team understand the purpose and impact of the pull request but also showcases exceptional communication skills.
Code looks all good to me, but take that with a grain of salt since I'm not too familiar with TF yet. Overall I got what the code does, but I didn't fully understand how this would be used to run tests in practice. As @strickvl suggested, I think it would make sense to add a first integration test with this PR that showcases how the custom runners would actually be used for integration tests. In particular, I would be interested in when exactly the deployment and destruction will happen, is the idea that these are called right before/after each integration test?
} | ||
|
||
data "azurerm_image" "example" { | ||
name = "mlstack-github-runner-machine-image-20230819162059" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't the image change over time or can this stay hardcoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can become a variable that for now has this as a default value, and it can be configurable for sure.
However for the initial versions, there will be a manual step since GitHub didn't provide an API for self-hosted runners, The issue is that we need to configure the runner agent within the VM with a one-time given token, that is why the configuration of the agent would still remain a manual step which we build a VM image on top of to use
…io/mlops-stacks into feature/create-self-hosted-runner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, LGTM.
Two things which I don't fully understand yet:
- What does the Azure storage bucket do?
- How would GH know where your self hosted runners are?
I guess these are just understanding questions from my side, so I'll approve already :)
k3d_test: | ||
name: k3d_test | ||
runs-on: ubuntu-latest | ||
runs-on: self-hosted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. I'm curious though, how would GitHub know where/how you have self-hosted it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The Azure storage is used as a lockdown mechanism to avoid the scenario where we have multiple runs at the same time in the VM and one finishes before the others, what the lock mechanism does is create a file with the run ID within the storage whenever a new run is called and before destroying the VM it first deletes the file that has the same run id and then does a check if there are any other files left, if none is left it allows the destroy if there is even one file left it means some run is still in progress and only last run would be allowed to destroy the VM
- GH detects that because the VM is configured and connected to the GH action server and it has a Heartbeats test that checks that the self-hosted runner is still running and lunch the runs within that connected runner (the self-hosted is the default name to any runner so if there is multiple it will run on free one otherwise we can give specific names to each runner)
@safoinme the runner doesn't seem to run, however. Something seems missing? or I'm not sure what's going on. |
@strickvl Yes, I was looking for the reason this morning it turns out that our token got invalidated because it wasn't used for so long, now we need to generate a new one. This is a big problem that I don't think we have a potential solution for unfortunately because there is no API to token generation, so if this happened we need to generate it manually and set it in the VM config |
Now that we know how to do the self-hosted runners, should we close this branch? We have a ticket to implement integration tests which we can separately do. @safoinme WDYT? |
I agree let's close this |
Now we have self-hosted runners implemented with ARC on an organization level. |
Introduction
This pull request (PR) addresses a long-standing challenge we've encountered with the K3d stack recipe. Specifically, our previous testing process on GitHub Actions fell short due to the resource-intensive nature of provisioning a K3d cluster and installing various applications.
To overcome this hurdle, we've introduced a solution leveraging GitHub's Self-hosted runners. These self-hosted runners grant us the flexibility to execute GitHub Actions workloads within our own custom environments, offering greater control and adaptability.
However, we are mindful of cost considerations and the environmental impact of maintaining VMs that run continuously. To address this, we've integrated Terraform into our workflow. With Terraform, we can dynamically provision VMs only when needed for testing purposes and efficiently de-provision them once testing is complete.
This PR represents a significant improvement in our testing infrastructure, allowing us to ensure the reliability and performance of the K3d stack recipe without incurring unnecessary costs or resource wastage. We look forward to your feedback and collaboration to further enhance our development process.
A full detailed document about this can be found here