-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOMKill for period of time after which it magically resolves without changes #1110
Comments
Does the time it takes to end up in a OOM kill allow you to collect a HEAP profile? Instructions for this can be found here: https://fluxcd.io/flux/gitops-toolkit/debugging/#collecting-a-profile |
I tried to catch it right before the kill. |
Same thing is happening with some of our clusters as well |
@cwrau did this start with RC.3 as well? In addition to this, did you run RC.2 or RC.1 before without issues? Based on the HEAP profile shared, I can't tell what is happening as it's taken before the actual issue seems to occur. What may help is temporarily increasing the limits to be able to take a proper snapshot while the thing happens without the Pod getting killed. |
No, this seems to be kinda version-universal, we have old clusters with source-controller 0.18.0 which also got OOM'd |
Is there any chance you are all making use of BitBucket and things have now returned to normal? |
No BitBucket for us that I can see; HelmRepositories;
GitRepositories;
(self hosted gitlab) But yes, it seems to have stopped |
Any chance your crashing instances are hosted on AWS EKS (AMD64, EC2, managed node groups)? I am trying to find a correlation here between your setups, as the issue seems to have started for a group of users at the same time, with a range of controller versions, and then magically stopped. |
Hello @hiddeco, I'm with the same organization as @cwrau. No we don't use AWS. We use our own infrastructure based on OpenStack. We are also trying to find correlations. Especially since this came and went out of nothing. Interestingly, a bunch of clusters using source-controller 1.0.0-rc3 didn't have the issue. |
Without this happening again, and a proper HEAP snapshot when this happens, I fear this will be very much like looking for a needle in a haystack. |
The clusters I know of just don't have a memory limit 😉 |
Oh, you're right. The all don't. Ooops! |
Same thing happened on at least one of our clusters; https://snapshots.raintank.io/dashboard/snapshot/6QEyFh33cs2tH6FYj8enOeY0rlbY7jvc |
Wouldn't this just be explained by the rapid pod restarts themselves, causing a burst in terms of queueing objects? |
Yeah, I was thinking the same thing but in my snapshot you can see that the responses take longer and the request and response sizes also got bigger, maybe that's got something to do with this? |
It's a signal, but hard to tell how that would result in such spurious resource usage. Not to mention that the correlation in time (while running in entirely different environments) continues to be a mystery. Any chance you collect logs somewhere that survive pod restarts? |
Of course; https://snapshots.raintank.io/dashboard/snapshot/o6vLv6rjgym3qTdZs2AH6c0m3CjQrIyl Better dashboard; https://snapshots.raintank.io/dashboard/snapshot/WinIZkonEwYwQaYoh8fQ1NtW2NK2yWxp |
Based on the following log lines (and alike), it seems that the whole cluster network in general was unstable at the time. {"level":"error","ts":"2023-05-31T11:57:14.737Z","logger":"runtime","msg":"Failed to release lock: Unauthorized\n"}
{"level":"info","ts":"2023-05-31T11:57:14.706Z","logger":"runtime","msg":"failed to renew lease flux-system/source-controller-leader-election: timed out waiting for the condition\n"}
{"level":"info","ts":"2023-05-31T11:57:14.703Z","msg":"Wait completed, proceeding to shutdown the manager"} {"level":"error","ts":"2023-05-31T11:57:12.809Z","logger":"runtime","msg":"k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta2.OCIRepository: failed to list *v1beta2.OCIRepository: Unauthorized\n"}
{"level":"info","ts":"2023-05-31T11:57:12.809Z","logger":"runtime","msg":"k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1beta2.OCIRepository: Unauthorized\n"}
{"level":"error","ts":"2023-05-31T11:57:12.788Z","logger":"runtime","msg":"k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1beta2.Bucket: failed to list *v1beta2.Bucket: Unauthorized\n"} {"level":"error","ts":"2023-05-31T11:57:10.671Z","logger":"runtime","msg":"Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"grafana.17643873d2342f01\", GenerateName:\"\", Namespace:\"flux-system\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"HelmRepository\", Namespace:\"flux-system\", Name:\"grafana\", UID:\"8d83e45a-9757-4b0d-8452-53a4d9404aa1\", APIVersion:\"[source.toolkit.fluxcd.io/v1beta2](http://source.toolkit.fluxcd.io/v1beta2)\", ResourceVersion:\"62460920\", FieldPath:\"\"}, Reason:\"GarbageCollectionSucceeded\", Message:\"garbage collected artifacts for deleted resource\", Source:v1.EventSource{Component:\"source-controller\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.May, 31, 11, 57, 10, 642437889, time.Local), LastTimestamp:time.Date(2023, time.May, 31, 11, 57, 10, 642437889, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}': 'events \"grafana.17643873d2342f01\" is forbidden: unable to create new content in namespace flux-system because it is being terminated' (will not retry!)\n"} This by itself does not explain the OOMKill, but may lead to pointers elsewhere (CoreDNS, Kubernetes API server?) which explain how we can replicate this. |
Mh, it just occurred to me, that that cluster might not be the best example, so let's instead take a look at this one; CoreDNS; https://snapshots.raintank.io/dashboard/snapshot/v7J2X7eC7xHsSXGRdRjJ3pXiTe6kja4r Sorrey 😅 |
we see the same on our 15+ Clusters in AWS EKS - Karpenter Nodes OOM137- we running latest version before RC versions what helps to find the problem ? Heap dump ? |
Here are two heap maps of the same pod, there are about 3h30min between these two: The memory consumption increased from about 290MB to 2240MB in this time |
Here is a heap map of the fiftieth incarnation of the pod in a crash loop where ist consumes about 12.4 GB of memory |
Please share the output from Generally speaking, it is recommended to update to newest versions even if they are release candidates. As we also documented in our release notes. In addition, I can see a lot of your memory usage @akesser is coming from the parsing of Helm repository indexes. It may be worth looking at https://fluxcd.io/flux/cheatsheets/bootstrap/#enable-helm-repositories-caching to see if this prevents the issue from happening. |
We updated to the newest version of flux. With the old version, the error occurred every 24 hours, now it occurs every 5 to 10 minutes. And we use helm repository caching:
|
And here is a heap dump: |
source-controller just got OOMKilled out of the blue. It started about an hour ago and I can't see any prior activity that triggered it. No recently added sources, nothing. It started with rc3. I upgraded it to rc4 but it's the same behaviour. The memory usage goes out of the roof and the cluster kills the pod.
flux check
gets stuck at this point as the source controller is not responding.The text was updated successfully, but these errors were encountered: