Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MM-51801] Cloud native support #34

Merged
merged 14 commits into from
Jul 28, 2023
Merged

[MM-51801] Cloud native support #34

merged 14 commits into from
Jul 28, 2023

Conversation

streamer45
Copy link
Contributor

Summary

PR adds cloud native support to the calls-offloader service. This mostly boils down to providing a Kubernetes API implementation based on Jobs as a more cloud oriented alternative than the existing one based on simple Docker containers.

Some previous context can also be found in #26.

Related PRs

Calls plugin: mattermost/mattermost-plugin-calls#409
Calls recorder: mattermost/calls-recorder#33

Design doc

https://docs.google.com/document/d/1dCik_WdamxBg_UaPYL_k04EAPFP1-rkINk-GTuIBjYc/edit

Ticket Link

https://mattermost.atlassian.net/browse/MM-51801

* JobService interface

* Bump build deps

* [MM-52323] Refactor docker implementation (#24)

* Refactor docker job service

* Add tests
* Setup k8s client

* Implement k8s API

* Remove StopJob

* Implement Init() API call

* Setup k8s CI

* Update sample config

* Add local k8s development doc

* Use human friendly prefix for job names

* Add support for passing custom tolerations
@streamer45 streamer45 added 2: Dev Review Requires review by a core committer 3: Security Review Do Not Merge/Awaiting PR Awaiting another pull request before merging (e.g. server changes) labels Jun 26, 2023
@streamer45 streamer45 added this to the v0.3.0 milestone Jun 26, 2023
@streamer45 streamer45 self-assigned this Jun 26, 2023
kubectl logs -l app=calls
```

### Check pod IP address

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@streamer45 Any objection of adding a k8s service in front of the calls offloader app and use this instead of IP? IPs are subject to change but svc will be used properly as an abstraction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stylianosrigas We discussed a bit on this matter in the design doc when considering load balancing options (see comment for context).

The way this service was originally designed wouldn't allow for a k8s service if running more than one pod because this is not really a stateless application and each calls-offloader instance is independent (no global DB or store, same as rtcd). Specifically, we used to track job IDs in order to stop recordings or get the status of a recording job.

Eventually I modified the approach a bit so that stopping a recording doesn't require us to hit the API anymore. So in theory we could use a service with multiple pods behind it if we are happy to lose some visibility (e.g. fetching job status and logs) from the MM side.

Copy link

@stylianosrigas stylianosrigas Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o global DB or store, same as rtcd). Specifically, we used to track job IDs in order to stop recordings or get the s

Unfortunately, I don't see another way here as there is no way to guarantee the specified IP that Mattermost needs. The svc is the only thing we can ensure to always hit the application. Fetching logs should not be a big problem, as we can still see logs via our logging tools. I don't know how big problem will be not getting job status in Mattermost though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the docs I think it could be achieved through StatefulSet + headless service to get a consistent DNS name pointing to the pod. That said, I don't know how much we'd like that approach.

From the MM/Calls perspective, after the changes we made, I believe there's no strong functionality requirement to access logs/job statuses other than to aid debugging. But again, in our case we have other monitoring tools in place so we could probably get away with not having those endpoints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stylianosrigas If we went with the service, what type would you recommend? Would we need an ingress as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stylianosrigas Right, traffic would be internal, would a simple NodePort service work for this case or would you recommend a LoadBalancer?

Does this affect somehow scaling of the service?

Not really as we implemented some logic on the client so that if an API call fails due to authentication, it will automatically attempt to register and perform the call again.

The only detail I'd like to get some clarity on is how the recorder container image will be loaded. In the existing docker setup we fetch it the first time we receive a connection from the plugin side. Not sure if/how this is doable in k8s to be honest. Loading the image at time of need is a bit inconvenient in my mind as it could delay the start of a recording by several seconds (or more). See #26 (comment) for more context.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodePort service should be fine. We don't need LoadBalancer as no external cluster traffic is required here.

The only detail I'd like to get some clarity on is how the recorder container image will be loaded

You mean the job running container image? This should be loaded when the job pod starts running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodePort service should be fine

Sounds good, thanks.

You mean the job running container image? This should be loaded when the job pod starts running.

Yes, that was my understanding as well. The only problem is that it's not a particularly lean image (~500MB compressed, see here) so the concern from my side is that it could delay starting a recording by several seconds (hopefully not minutes). Of course I expect the image to get cached so it should only affect the very first attempt.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will set the image pull policy https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy to IfNotPresent in the job, so it only pull this the very first time it runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, let's try it this way. If it becomes a problem I suppose we could devise some workaround, such as a dry-run job that does nothing but forces fetching the image.

Spec: batchv1.JobSpec{
// We only support one recording job at a time and don't want it to
// restart on failure.
Parallelism: newInt32(1),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@streamer45 We should set also ttlSecondsAfterFinished and sth like 48h ( it is in seconds) so that finished jobs do not need a manual deletion. This will help from ending up with hundreds of completed jobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The way it's working now is that we automatically delete successful jobs but retain failed ones for the reason being that deleting a failed job could mean data loss (e.g. recording is not uploaded). This way we give administrators a chance to recover files if necessary.

That said, we could set TTLSecondsAfterFinished as you suggest but if we do I'd probably make it configurable as it becomes a data retention setting in practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind, given this will only affect failed jobs, I am considering deferring to https://mattermost.atlassian.net/browse/MM-49202 mostly to avoid exposing an implementation specific setting that only works for k8s.

Copy link

@stylianosrigas stylianosrigas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great, two main points from my side ;)

Copy link
Member

@cpoile cpoile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the careful work @streamer45! 🎉

@streamer45 streamer45 added 3: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer 3: Reviews Complete All reviewers have approved the pull request labels Jul 6, 2023
@streamer45
Copy link
Contributor Author

@stylianosrigas I think we should be able to use mattermost/calls-offloader-daily:dev-526c668 for testing, mostly to keep the PR open for the security review. Let me know.

@stylianosrigas
Copy link

@stylianosrigas I think we should be able to use mattermost/calls-offloader-daily:dev-526c668 for testing, mostly to keep the PR open for the security review. Let me know.

@streamer45 Is this based on the latest PR code?

@streamer45 streamer45 added 3: Reviews Complete All reviewers have approved the pull request and removed Do Not Merge/Awaiting PR Awaiting another pull request before merging (e.g. server changes) labels Jul 24, 2023
@streamer45 streamer45 merged commit 6daa13a into master Jul 28, 2023
3 checks passed
@streamer45 streamer45 deleted the MM-51801-cloud-native branch July 28, 2023 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3: Reviews Complete All reviewers have approved the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants