-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MM-51801] Cloud native support #34
Conversation
* JobService interface * Bump build deps * [MM-52323] Refactor docker implementation (#24) * Refactor docker job service * Add tests
* Setup k8s client * Implement k8s API * Remove StopJob * Implement Init() API call * Setup k8s CI * Update sample config * Add local k8s development doc * Use human friendly prefix for job names * Add support for passing custom tolerations
docs/kubernetes_development.md
Outdated
kubectl logs -l app=calls | ||
``` | ||
|
||
### Check pod IP address |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@streamer45 Any objection of adding a k8s service in front of the calls offloader app and use this instead of IP? IPs are subject to change but svc will be used properly as an abstraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stylianosrigas We discussed a bit on this matter in the design doc when considering load balancing options (see comment for context).
The way this service was originally designed wouldn't allow for a k8s service if running more than one pod because this is not really a stateless application and each calls-offloader
instance is independent (no global DB or store, same as rtcd
). Specifically, we used to track job IDs in order to stop recordings or get the status of a recording job.
Eventually I modified the approach a bit so that stopping a recording doesn't require us to hit the API anymore. So in theory we could use a service with multiple pods behind it if we are happy to lose some visibility (e.g. fetching job status and logs) from the MM side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
o global DB or store, same as
rtcd
). Specifically, we used to track job IDs in order to stop recordings or get the s
Unfortunately, I don't see another way here as there is no way to guarantee the specified IP that Mattermost needs. The svc is the only thing we can ensure to always hit the application. Fetching logs should not be a big problem, as we can still see logs via our logging tools. I don't know how big problem will be not getting job status in Mattermost though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the docs I think it could be achieved through StatefulSet
+ headless service to get a consistent DNS name pointing to the pod. That said, I don't know how much we'd like that approach.
From the MM/Calls perspective, after the changes we made, I believe there's no strong functionality requirement to access logs/job statuses other than to aid debugging. But again, in our case we have other monitoring tools in place so we could probably get away with not having those endpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stylianosrigas If we went with the service, what type would you recommend? Would we need an ingress as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stylianosrigas Right, traffic would be internal, would a simple NodePort
service work for this case or would you recommend a LoadBalancer
?
Does this affect somehow scaling of the service?
Not really as we implemented some logic on the client so that if an API call fails due to authentication, it will automatically attempt to register and perform the call again.
The only detail I'd like to get some clarity on is how the recorder container image will be loaded. In the existing docker setup we fetch it the first time we receive a connection from the plugin side. Not sure if/how this is doable in k8s to be honest. Loading the image at time of need is a bit inconvenient in my mind as it could delay the start of a recording by several seconds (or more). See #26 (comment) for more context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodePort service should be fine. We don't need LoadBalancer as no external cluster traffic is required here.
The only detail I'd like to get some clarity on is how the recorder container image will be loaded
You mean the job running container image? This should be loaded when the job pod starts running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodePort service should be fine
Sounds good, thanks.
You mean the job running container image? This should be loaded when the job pod starts running.
Yes, that was my understanding as well. The only problem is that it's not a particularly lean image (~500MB compressed, see here) so the concern from my side is that it could delay starting a recording by several seconds (hopefully not minutes). Of course I expect the image to get cached so it should only affect the very first attempt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will set the image pull policy https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy to IfNotPresent in the job, so it only pull this the very first time it runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, let's try it this way. If it becomes a problem I suppose we could devise some workaround, such as a dry-run job that does nothing but forces fetching the image.
Spec: batchv1.JobSpec{ | ||
// We only support one recording job at a time and don't want it to | ||
// restart on failure. | ||
Parallelism: newInt32(1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@streamer45 We should set also ttlSecondsAfterFinished and sth like 48h ( it is in seconds) so that finished jobs do not need a manual deletion. This will help from ending up with hundreds of completed jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The way it's working now is that we automatically delete successful jobs but retain failed ones for the reason being that deleting a failed job could mean data loss (e.g. recording is not uploaded). This way we give administrators a chance to recover files if necessary.
That said, we could set TTLSecondsAfterFinished
as you suggest but if we do I'd probably make it configurable as it becomes a data retention setting in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't mind, given this will only affect failed jobs, I am considering deferring to https://mattermost.atlassian.net/browse/MM-49202 mostly to avoid exposing an implementation specific setting that only works for k8s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great, two main points from my side ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the careful work @streamer45! 🎉
@stylianosrigas I think we should be able to use mattermost/calls-offloader-daily:dev-526c668 for testing, mostly to keep the PR open for the security review. Let me know. |
@streamer45 Is this based on the latest PR code? |
Summary
PR adds cloud native support to the
calls-offloader
service. This mostly boils down to providing a Kubernetes API implementation based on Jobs as a more cloud oriented alternative than the existing one based on simple Docker containers.Some previous context can also be found in #26.
Related PRs
Calls plugin: mattermost/mattermost-plugin-calls#409
Calls recorder: mattermost/calls-recorder#33
Design doc
https://docs.google.com/document/d/1dCik_WdamxBg_UaPYL_k04EAPFP1-rkINk-GTuIBjYc/edit
Ticket Link
https://mattermost.atlassian.net/browse/MM-51801