Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery lms/cms-worker consumes too much RAM #1126

Open
igobranco opened this issue Sep 27, 2024 · 4 comments
Open

Celery lms/cms-worker consumes too much RAM #1126

igobranco opened this issue Sep 27, 2024 · 4 comments
Labels
feature request New features will be processed by decreasing priority

Comments

@igobranco
Copy link

Bug description

The normal execution of the celery workers, lms-worker and cms-worker services, consumes too much memory by default.
Tested on my tutor local, but I'm assuming that with tutor dev is the same.
By default the celery start 1 process by each CPU. If the deployed laptop or server has multiple CPU architecture, then the celery will launch multiple OS processes.
Each celery OS process consumes memory.
This makes more CPUs you have, more memory RAM the celery worker will consume.
The problem that I see is that on a deployed server, the operator probably don't want this magic, and want to control how much celery OS process each container/pod will start and consequently how much memory each container/pods will consume.

How to reproduce

Run a tutor local environment and see how much RAM your celery workers is consuming.
On 16 CPU server/laptop an idle environment each container lms/cms-worker consumes >2GB RAM per container.

If you add --concurrency=1 to the lms/cms-worker command an idle tutor local env uses <300MB per container.

Environment

tutor, version 14.2.3
But this applies also to newer version.

Solution (my opinion)

  1. Add a way to parameterize the concurrency on each worker, ex: --concurrency={{ LMS_WORKER_CELERY_CONCURRENCY }}
  2. Each celery worker by default should always consume the minimum memory, so it should defaults to just 1.
  3. In my opinion there should be an easy way to customize both celery commands, lms-worker and cms-worker. Ex. adding the --prefetch-multiplier=1 --without-gossip --without-mingle
@regisb
Copy link
Contributor

regisb commented Sep 30, 2024

Related PR: #1010

For more information regarding additional options, see: edx/configuration#68

@regisb
Copy link
Contributor

regisb commented Sep 30, 2024

Hi Ivo, I agree that we need to provide a way to override the number of concurrent workers. And we also need to provide good defaults for that value.

In that context, I'm curious to have your take on this comment by @dkaliberda, where he makes the case that we should default to concurrency=1 and instead scale the number of replicas.

I'm not sure I agree with that comment, because in my experience there is quite a bit of overhead incurred by scaling celery workers horizontally (by increasing the number of replicas) as opposed to vertically (by increasing the number of workers). For instance, here are the figures for memory usage on my laptop, in idle mode:

  • 1 replica x 1 worker: 319 MB
  • 1 replica x 4 workers: 746 MB
  • 4 replicas x 1 worker: 1280 MB

In your current use case, what would be the ideal number of replicas/workers? (both for LMS and CMS)

EDIT: I just learned about process autoscaling in Celery and I'm very tempted to use that as the default. Of course, we would still need to provide a mechanism to override that.

@regisb regisb added the feature request New features will be processed by decreasing priority label Oct 1, 2024
@igobranco
Copy link
Author

Hi @regisb, I think a default concurrency=1 is a good value. Nevertheless, I prefer that it should be parameterized.
Currently, for our Kubernetes PROD deployment I have min 6 replicas, each with concurrency=1, but an horizontal auto scaler for 20. I have dedicated hardware so having 6 replicas always up is not a problem for us. Only sporadic is that it upscale up.

About the autoscaling in Celery, I also just found out it! I think I won't change my current setup, because it's just working. But if had found out before, I think I would be tempted to just use it. Even for our case just fixed 2 replicas with a vertical autoscale on Celery would be good option. The good news is that it would benefit everyone, docker compose or K8s installations.
--concurrency {{ OPENEDX_CMS_CELERY_WORKERS }} --autoscale={{ OPENEDX_CMS_CELERY_MAX_WORKERS }},{{ OPENEDX_CMS_CELERY_MIN_WORKERS }}

3 configurations with:

  • OPENEDX_CMS_CELERY_WORKERS default 1
  • OPENEDX_CMS_CELERY_MAX_WORKERS default 1
  • OPENEDX_CMS_CELERY_MIN_WORKERS default 1

An upgrade note on the docs could be added to configure a proper value of OPENEDX_CMS_CELERY_MAX_WORKERS, like number of CPUs

For example this is a snippet of my custom tutor plugin to override the workers:

apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: cms-worker
    spec:
      template:
        spec:
          terminationGracePeriodSeconds: 900
          containers:
            - name: cms-worker
              args:
                - celery
                - --app=cms.celery
                - worker
                - --loglevel=info
                - --concurrency=1
                - --hostname=edx.cms.core.default.%%h
                - --max-tasks-per-child=100
                - --prefetch-multiplier=1
                - --exclude-queues=edx.lms.core.default
                - --without-gossip
                - --without-mingle

The terminationGracePeriodSeconds is to prevent the Instructors export CSVs to be terminated during the export.
Recently, I had to add --without-gossip and --without-mingle to make the celery workers recover from a Redis downtime, celery/celery#7276

@regisb
Copy link
Contributor

regisb commented Oct 1, 2024

While I do think that administrators should be able to customise the default celery concurrency, I disagree that --concurrency=1 is the better default -- because it's much more expensive. According to my testing (see my comment here) running multiple replicas with a single process has a bigger overhead in terms of memory usage. (although I'm not sure exactly why).

I think that --autoscale=4,1 would be a good default for most people. We should be able to override this default setting using config patches or filters. Thus, I'm not inclined to merge #1010. I'll propose a different implementation.

I haven't looked yet at the other options that you are suggesting -- I'll investigate once I start working on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New features will be processed by decreasing priority
Projects
Development

No branches or pull requests

2 participants