Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kustomize-controller gets OOMKilled every hour #1111

Open
bharathvrajan opened this issue Mar 15, 2024 · 3 comments
Open

kustomize-controller gets OOMKilled every hour #1111

bharathvrajan opened this issue Mar 15, 2024 · 3 comments

Comments

@bharathvrajan
Copy link

bharathvrajan commented Mar 15, 2024

Background:

The kustomize-controller pod is getting OOMKilled every hour or so. Its reaches around ~7.65G and gets OOM Killed as the memory limit is 8G.

  • Image - ghcr.artifactory.gcp.anz/fluxcd/kustomize-controller:v1.2.2
  • There are 184 kustomizations in total
  • Concurrency is set to 20.

These are the flags enabled:

      containers:
      - args:
        - --events-addr=http://notification-controller.flux-system.svc.cluster.local./
        - --watch-all-namespaces=true
        - --log-level=info
        - --log-encoding=json
        - --enable-leader-election
        - --concurrent=20
        - --kube-api-qps=500
        - --kube-api-burst=1000
        - --requeue-dependency=15s
        - --no-remote-bases=true
        - --feature-gates=DisableStatusPollerCache=true

Requests & Limits:

        resources:
          limits:
            memory: 8Gi
          requests:
            cpu: "1"
            memory: 8Gi

What's been tried so far:

  1. Added the flag --feature-gates=DisableStatusPollerCache=true to the kustomize-controller deployment, as mentioned in this issue - But this didn't make a difference, it still gets OOM killed in an hour.

  2. Reduced the concurrency to 5 - At this point, the pod seems stable and memory consumption is around ~2.5G

  3. Did a heap dump and the inuse_space is around ~22.64MB which is really less. Couldn't find anything useful there, but here's the link to the flamegraph. Also, here's the heap dump - heap.out.zip

  4. Checked if we have a large repository that's loading unnecessary files as mentioned in this issue

    This is from the source-controller:

    ~ $ du -sh /data/*
    6.1M	     /data/gitrepository
    824.0K     /data/helmchart
    5.8M	    /data/helmrepository
    16.0K	    /data/lost+found
    48.0K	    /data/ocirepository 
    

Want to understand what is causing the memory spike and OOM killings.

@stefanprodan
Copy link
Member

stefanprodan commented Mar 15, 2024

Are you using a RAM disk for the /tmp volume like showed here https://fluxcd.io/flux/installation/configuration/vertical-scaling/#enable-in-memory-kustomize-builds?

Can you look in /tmp in kustomize-controller pod and see how large is it?

@bharathvrajan
Copy link
Author

bharathvrajan commented Mar 17, 2024

Are you using a RAM disk

We used in-memory kustomizations, but it was being a problem. It keeps exceeding the memory limits of the nodes. We also tried using Ephemeral SSDs, they got corrupted when the kustomize-controller restarted. So currently the /tmp is backed by a disk.

The size of the /tmp is 12.7G

$ du -sh tmp
12.7G	tmp

@stefanprodan
Copy link
Member

Ok so looks like all these problems are due to FS operations. The tmp should be empty almost all the time. Is there anything inside the repo that could cause this, recursive symlinks or such? Looking at the memory profile the issue seems related to Go untar and file read operations which are all from Go stdlib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants