kustomize-controller gets OOMKilled every hour #1111

bharathvrajan · 2024-03-15T01:34:33Z

Background:

The kustomize-controller pod is getting OOMKilled every hour or so. Its reaches around ~7.65G and gets OOM Killed as the memory limit is 8G.

Image - ghcr.artifactory.gcp.anz/fluxcd/kustomize-controller:v1.2.2
There are 184 kustomizations in total
Concurrency is set to 20.

These are the flags enabled:

      containers:
      - args:
        - --events-addr=http://notification-controller.flux-system.svc.cluster.local./
        - --watch-all-namespaces=true
        - --log-level=info
        - --log-encoding=json
        - --enable-leader-election
        - --concurrent=20
        - --kube-api-qps=500
        - --kube-api-burst=1000
        - --requeue-dependency=15s
        - --no-remote-bases=true
        - --feature-gates=DisableStatusPollerCache=true

Requests & Limits:

        resources:
          limits:
            memory: 8Gi
          requests:
            cpu: "1"
            memory: 8Gi

What's been tried so far:

Added the flag --feature-gates=DisableStatusPollerCache=true to the kustomize-controller deployment, as mentioned in this issue - But this didn't make a difference, it still gets OOM killed in an hour.
Reduced the concurrency to 5 - At this point, the pod seems stable and memory consumption is around ~2.5G
Did a heap dump and the inuse_space is around ~22.64MB which is really less. Couldn't find anything useful there, but here's the link to the flamegraph. Also, here's the heap dump - heap.out.zip

Checked if we have a large repository that's loading unnecessary files as mentioned in this issue

This is from the source-controller:

~ $ du -sh /data/*
6.1M	     /data/gitrepository
824.0K     /data/helmchart
5.8M	    /data/helmrepository
16.0K	    /data/lost+found
48.0K	    /data/ocirepository

Want to understand what is causing the memory spike and OOM killings.

The text was updated successfully, but these errors were encountered:

stefanprodan · 2024-03-15T06:40:04Z

Are you using a RAM disk for the /tmp volume like showed here https://fluxcd.io/flux/installation/configuration/vertical-scaling/#enable-in-memory-kustomize-builds?

Can you look in /tmp in kustomize-controller pod and see how large is it?

bharathvrajan · 2024-03-17T23:34:58Z

Are you using a RAM disk

We used in-memory kustomizations, but it was being a problem. It keeps exceeding the memory limits of the nodes. We also tried using Ephemeral SSDs, they got corrupted when the kustomize-controller restarted. So currently the /tmp is backed by a disk.

The size of the /tmp is 12.7G

$ du -sh tmp
12.7G	tmp

stefanprodan · 2024-03-18T08:07:11Z

Ok so looks like all these problems are due to FS operations. The tmp should be empty almost all the time. Is there anything inside the repo that could cause this, recursive symlinks or such? Looking at the memory profile the issue seems related to Go untar and file read operations which are all from Go stdlib.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kustomize-controller gets OOMKilled every hour #1111

kustomize-controller gets OOMKilled every hour #1111

bharathvrajan commented Mar 15, 2024 •

edited

Loading

stefanprodan commented Mar 15, 2024 •

edited

Loading

bharathvrajan commented Mar 17, 2024 •

edited

Loading

stefanprodan commented Mar 18, 2024

kustomize-controller gets OOMKilled every hour #1111

kustomize-controller gets OOMKilled every hour #1111

Comments

bharathvrajan commented Mar 15, 2024 • edited Loading

Background:

What's been tried so far:

stefanprodan commented Mar 15, 2024 • edited Loading

bharathvrajan commented Mar 17, 2024 • edited Loading

stefanprodan commented Mar 18, 2024

bharathvrajan commented Mar 15, 2024 •

edited

Loading

stefanprodan commented Mar 15, 2024 •

edited

Loading

bharathvrajan commented Mar 17, 2024 •

edited

Loading