[MM-51801] Cloud native support #34

streamer45 · 2023-06-26T16:46:21Z

Summary

PR adds cloud native support to the calls-offloader service. This mostly boils down to providing a Kubernetes API implementation based on Jobs as a more cloud oriented alternative than the existing one based on simple Docker containers.

Some previous context can also be found in #26.

* JobService interface * Bump build deps * [MM-52323] Refactor docker implementation (#24) * Refactor docker job service * Add tests

* Setup k8s client * Implement k8s API * Remove StopJob * Implement Init() API call * Setup k8s CI * Update sample config * Add local k8s development doc * Use human friendly prefix for job names * Add support for passing custom tolerations

stylianosrigas · 2023-06-27T06:45:23Z

docs/kubernetes_development.md

+kubectl logs -l app=calls
+```
+
+### Check pod IP address


@streamer45 Any objection of adding a k8s service in front of the calls offloader app and use this instead of IP? IPs are subject to change but svc will be used properly as an abstraction.

@stylianosrigas We discussed a bit on this matter in the design doc when considering load balancing options (see comment for context).

The way this service was originally designed wouldn't allow for a k8s service if running more than one pod because this is not really a stateless application and each calls-offloader instance is independent (no global DB or store, same as rtcd). Specifically, we used to track job IDs in order to stop recordings or get the status of a recording job.

Eventually I modified the approach a bit so that stopping a recording doesn't require us to hit the API anymore. So in theory we could use a service with multiple pods behind it if we are happy to lose some visibility (e.g. fetching job status and logs) from the MM side.

o global DB or store, same as rtcd). Specifically, we used to track job IDs in order to stop recordings or get the s

Unfortunately, I don't see another way here as there is no way to guarantee the specified IP that Mattermost needs. The svc is the only thing we can ensure to always hit the application. Fetching logs should not be a big problem, as we can still see logs via our logging tools. I don't know how big problem will be not getting job status in Mattermost though.

Looking at the docs I think it could be achieved through StatefulSet + headless service to get a consistent DNS name pointing to the pod. That said, I don't know how much we'd like that approach.

From the MM/Calls perspective, after the changes we made, I believe there's no strong functionality requirement to access logs/job statuses other than to aid debugging. But again, in our case we have other monitoring tools in place so we could probably get away with not having those endpoints.

@stylianosrigas If we went with the service, what type would you recommend? Would we need an ingress as well?

@stylianosrigas Right, traffic would be internal, would a simple NodePort service work for this case or would you recommend a LoadBalancer?

Does this affect somehow scaling of the service?

Not really as we implemented some logic on the client so that if an API call fails due to authentication, it will automatically attempt to register and perform the call again.

The only detail I'd like to get some clarity on is how the recorder container image will be loaded. In the existing docker setup we fetch it the first time we receive a connection from the plugin side. Not sure if/how this is doable in k8s to be honest. Loading the image at time of need is a bit inconvenient in my mind as it could delay the start of a recording by several seconds (or more). See #26 (comment) for more context.

NodePort service should be fine. We don't need LoadBalancer as no external cluster traffic is required here.

The only detail I'd like to get some clarity on is how the recorder container image will be loaded

You mean the job running container image? This should be loaded when the job pod starts running.

NodePort service should be fine

Sounds good, thanks.

You mean the job running container image? This should be loaded when the job pod starts running.

Yes, that was my understanding as well. The only problem is that it's not a particularly lean image (~500MB compressed, see here) so the concern from my side is that it could delay starting a recording by several seconds (hopefully not minutes). Of course I expect the image to get cached so it should only affect the very first attempt.

We will set the image pull policy https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy to IfNotPresent in the job, so it only pull this the very first time it runs.

Alright, let's try it this way. If it becomes a problem I suppose we could devise some workaround, such as a dry-run job that does nothing but forces fetching the image.

stylianosrigas · 2023-06-27T06:52:44Z

service/kubernetes/service.go

+		Spec: batchv1.JobSpec{
+			// We only support one recording job at a time and don't want it to
+			// restart on failure.
+			Parallelism:  newInt32(1),


@streamer45 We should set also ttlSecondsAfterFinished and sth like 48h ( it is in seconds) so that finished jobs do not need a manual deletion. This will help from ending up with hundreds of completed jobs.

Thanks. The way it's working now is that we automatically delete successful jobs but retain failed ones for the reason being that deleting a failed job could mean data loss (e.g. recording is not uploaded). This way we give administrators a chance to recover files if necessary.

That said, we could set TTLSecondsAfterFinished as you suggest but if we do I'd probably make it configurable as it becomes a data retention setting in practice.

If you don't mind, given this will only affect failed jobs, I am considering deferring to https://mattermost.atlassian.net/browse/MM-49202 mostly to avoid exposing an implementation specific setting that only works for k8s.

stylianosrigas

Overall looks great, two main points from my side ;)

cpoile

Thank you for the careful work @streamer45! 🎉

streamer45 · 2023-07-10T14:14:40Z

@stylianosrigas I think we should be able to use mattermost/calls-offloader-daily:dev-526c668 for testing, mostly to keep the PR open for the security review. Let me know.

stylianosrigas · 2023-07-10T14:15:31Z

@stylianosrigas I think we should be able to use mattermost/calls-offloader-daily:dev-526c668 for testing, mostly to keep the PR open for the security review. Let me know.

@streamer45 Is this based on the latest PR code?

streamer45 added 11 commits April 20, 2023 10:08

[MM-52322] Support multiple APIs implementations (#23)

aa2b09b

* JobService interface * Bump build deps * [MM-52323] Refactor docker implementation (#24) * Refactor docker job service * Add tests

Build and release docker image (#25)

92df5ef

[MM-52346] Kubernetes implementation (#26)

a0ec0f6

* Setup k8s client * Implement k8s API * Remove StopJob * Implement Init() API call * Setup k8s CI * Update sample config * Add local k8s development doc * Use human friendly prefix for job names * Add support for passing custom tolerations

Merge remote-tracking branch 'origin/master' into MM-51801-cloud-native

02195f6

Expose version info (#28)

1417bcb

Implement unauthorized client error (#27)

a94e373

Merge remote-tracking branch 'origin/master' into MM-51801-cloud-native

ea6c852

Update k8s client deps to latest

80ce036

Decouple public packages (#32)

d94cee0

Enforce MaxConcurrentJobs limit (#33)

8a13f47

Bump calls-recorder

5bad3cb

streamer45 added 2: Dev Review Requires review by a core committer 3: Security Review Do Not Merge/Awaiting PR Awaiting another pull request before merging (e.g. server changes) labels Jun 26, 2023

streamer45 added this to the v0.3.0 milestone Jun 26, 2023

streamer45 requested review from cpoile and stylianosrigas June 26, 2023 16:46

streamer45 self-assigned this Jun 26, 2023

stylianosrigas reviewed Jun 27, 2023

View reviewed changes

cpoile approved these changes Jun 28, 2023

View reviewed changes

Update dev docs

3b19609

stylianosrigas approved these changes Jul 5, 2023

View reviewed changes

Explicitly set image pulling policy

526c668

streamer45 added 3: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer 3: Reviews Complete All reviewers have approved the pull request labels Jul 6, 2023

Support fetching dev builds from calls-recorder-daily registry

762110d

jupenur approved these changes Jul 20, 2023

View reviewed changes

jupenur removed the 3: Security Review label Jul 20, 2023

streamer45 added 3: Reviews Complete All reviewers have approved the pull request and removed Do Not Merge/Awaiting PR Awaiting another pull request before merging (e.g. server changes) labels Jul 24, 2023

streamer45 merged commit 6daa13a into master Jul 28, 2023
3 checks passed

streamer45 deleted the MM-51801-cloud-native branch July 28, 2023 15:22

streamer45 mentioned this pull request Jul 28, 2023

Fix image name in k8s CI #35

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MM-51801] Cloud native support #34

[MM-51801] Cloud native support #34

streamer45 commented Jun 26, 2023

stylianosrigas Jun 27, 2023

streamer45 Jun 27, 2023

stylianosrigas Jun 28, 2023 •

edited

Loading

streamer45 Jun 28, 2023

streamer45 Jun 28, 2023

streamer45 Jun 29, 2023

stylianosrigas Jun 30, 2023

streamer45 Jun 30, 2023

stylianosrigas Jul 5, 2023

streamer45 Jul 6, 2023

stylianosrigas Jun 27, 2023

streamer45 Jun 27, 2023

streamer45 Jun 29, 2023

stylianosrigas left a comment

cpoile left a comment

streamer45 commented Jul 10, 2023

stylianosrigas commented Jul 10, 2023

[MM-51801] Cloud native support #34

[MM-51801] Cloud native support #34

Conversation

streamer45 commented Jun 26, 2023

Summary

Related PRs

Design doc

Ticket Link

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stylianosrigas Jun 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stylianosrigas left a comment

Choose a reason for hiding this comment

cpoile left a comment

Choose a reason for hiding this comment

streamer45 commented Jul 10, 2023

stylianosrigas commented Jul 10, 2023

stylianosrigas Jun 28, 2023 •

edited

Loading