-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: add automated and on demand testing of fluence #49
Conversation
ab0c6ec
to
a85f0ec
Compare
Problem: we cannot tell if/when fluence builds will break against upstream Solution: have a weekly run that will build and test images, and deploy on successful results. For testing, I have added a complete example that uses Job for fluence/default-scheduler, and the reason is because we can run a container that generates output, have it complete, and there is no crash loop backoff or similar. I have added a complete testing setup using kind, and it is in one GitHub job so we can build both containers and load into kind, and then run the tests. Note that MiniKube does NOT appear to work for custom schedulers - I suspect there are extensions/plugins that need to be added. Finally, I was able to figure out how to programmatically check both the pod metadata for the scheduler along with events, and that combined with the output should be sufficient (for now) to test that fluence is working. Signed-off-by: vsoch <[email protected]>
a85f0ec
to
e96e866
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! Thank you for all your help with this infrastructure.
kubectl apply -f fluence-job.yaml | ||
kubectl apply -f default-job.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that scheduling pods with kube-scheduler and Fluence on the same cluster isn't supported. There isn't currently any way to propagate pod-to-node mappings generated by kube-scheduler to Fluence.
It's important that kubectl apply -f fluence-job.yaml
is executed before kubectl apply -f default-job.yaml
, and that they don't specify limits or requests so they could be scheduled on the same node. That's currently the case in this PR, but I'm emphasizing it for posterity.
Regardless, there still may be some funky race condition that occurs and results in unschedulable pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important that kubectl apply -f fluence-job.yaml is executed before kubectl apply -f default-job.yaml, and that they don't specify limits or requests so they could be scheduled on the same node. That's currently the case in this PR, but I'm emphasizing it for posterity.
Gotcha - I think likely for the testing cluster (and the example we already had in main) we are just doing that, putting them on the same node, and since it's a tiny kind or otherwise local cluster, there hasn't been an issue. If we extended this to an actual setup, there would be. This is an important point and I've opened an issue for emphasizing it in in future docs: #53 and maybe we can think of a creative way to allow for both, possibly with kueue resource flavors that create distinct (separate) resources that are labeled for each.
This will be extremely helpful to reduce drift from upstream. |
Huge agree! It will be much easier to fix tiny issues that pop up along the way, and I volunteer to take charge of monitoring that (and opening PRs with any fixes that are needed). This setup is also useful for making sure the containers (fluence and sidecar) we are deploying (at the same frequency) are provided with the latest build (that works) combined with kubernetes-sigs/scheduler-plugins. |
Problem: we cannot tell if/when fluence builds will break against upstream
Solution: have a weekly run that will build and test images, and deploy on successful results. For testing, I have added a complete example that uses Job for fluence/default-scheduler, and the reason is because we can run a container that generates output, have it complete, and there is no crash loop backoff or similar. I have added a complete testing setup using kind, and it is in one GitHub job so we can build both containers and load into kind, and then run the tests. Note that MiniKube does NOT appear to work for custom schedulers - I suspect there are extensions/plugins that need to be added. Finally, I was able to figure out how to programmatically check both the pod metadata for the scheduler along with events, and that combined with the output should be sufficient (for now) to test that fluence is working.
Summary
In summary this PR:
Interesting Things I Learned
kubectl events
(see below)I found these commands useful to checking scheduler assignment. The first is the schedulerName (generated from the job)
That worked for both. But it might be the case that the schedulerName we provide is not actually the one assigned (or maybe it doesn't run if it can't be satisfied, I'm not sure). Either way, makes sense to check via the event. And getting the event was more tricky - in both cases I was interested in the "Reason" -> "Scheduled." For fluence, I found the name under .reportingComponent, and for the default-scheduler that field was blank, and I found it under
.source.component
. For those interested, here are two events to compare.Default Scheduler "Scheduled" Event
And for fluence we actually see that source is empty (the opposite)
Fluence "Scheduled" Event
I thought that was interesting - it must be designed that the default-scheduler is not considered an extra component (and fluence is) and fluence is not considered some core kubernetes source. I have no idea, I'll probably Google around / ask people about that subtle difference. So here is the jq fu (jq is the best tool!) to get the exact output for each:
This might take a few iterations to get working in CI (I haven't used this setup kind action before) and I can ping folks when it is done.
Ok, everything is set. Ping @cmisale and @milroy for review, and of course no rush, it's ready when we need it!