-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible memory leak from the profiler #1025
Comments
If useful I could try to revert to v1.32.1 + #976 - |
@nik-andreev sorry for the late reply, I was on vacation. I'm also very sorry you're hitting this problem! This is indeed an unexpected amount of memory usage and I consider it high priority to figure out the root cause for it.
The reason I'm asking is because the way Go's built-in heap profiler is designed, it keeps using more memory until the stack traces of all allocating code paths have been hit. Usually that's a small amount of possible stack traces, but recursion can greatly expand this. Most people are unaware of this, because the memory used by the memory profiler is not reported in the heap profile. However, it can be observed via v1.33.0 of dd-trace-go added a new feature which ends up duplicating some of the memory already allocated by Go internally which also ends up making it visible in the heap profile. The way it's implemented it should not be a memory leak (beyond the leakyness that's already inherent to the built-in heap profiler), but seeing > 500MB allocated in your last screenshots makes me wonder if there is a bug. If you want, feel free to ping me on our public slack for direct support on this. |
No worries @felixge, thanks for looking! We ended up reverting to v1.32.1 + #976 and that solved the immediate problem. I won't be able to go back to v1.33 for at least few weeks so the data on 4 might be delayed. To answer your questions:
|
@nik-andreev thanks for the fast response. If you can instrument I just thought about another thing worth instrumenting: Can you pass a statsd client to the profiler like this? If yes, it'd be very interesting to see the datadog.profiler.go.uploaded_profile_bytes metric this will enable. If the problem is indeed caused by the large code base, this should show profile uploads as large as 250MB when the problem hits. |
@felixge, finally was able to collect some data. The app emits the size of I haven't yet changed the profiler to have statsd client. |
@nik-andreev thank you so much for sharing this. Is each line representing a different instance of your app? If yes it seems like one instance is using 160 MB of BuckHashSys and still growing towards the end (blue line). That's a lot. I just had a look at our internal apps written in Go, and the highest value I found was one app peaking at 70 MB. But even that one was a huge outlier, most apps were hovering around 10MB. And those are all pretty big code bases. Based on this I'd say that the profiler memory usage you're seeing in v1.33.0 makes sense. The way we've implemented delta profiles requires hanging on to the most recent allocation profile for a bit and in your case that can be ~160 MB * X where X is the overhead caused by the fact that the profile dd-trace-go uses is symbolized (BuckHashSys is not). This is very unfortunate, but it's really just an amplification of the BuckHashSys memory usage of the Go runtime itself. Anyway, I can think of a few ways to improve the situation:
Let me know if you have any preferences on the above. Personally I'd favor option 3 and/or 1. |
Yep, each line is instance of the app. The first option is really appealing as we can get back to the main version of the library and experiment easily. It would be great if you can add it in the next scheduled released. Also appreciate the offer in option 3! I might ping you at some time in the slack channel though not sure when I will get back to it. |
@nik-andreev sounds good. I'll try to land a patch to add a toggle for delta profiles ASAP. And my offer for option 3 isn't going away. Ping me on slack, or even better send an e-mail to [email protected] and reference my name. This would be the best way to securely exchange/access the relevant data. |
Delta profiles require a certain amount of heap memory that is proportional to MemStats.BuckHashSys (typically ~2x). For applications that make heavy use of code generation or recursion the memory usage can be several hundred MB. Add WithDeltaProfiles() that allows to disable delta profiles for applications where this is a deal breaker. Those applications will still see allocation profiles, but they won't be able to aggregate or compare them. The default value for WithDeltaProfiles() is true since applications with high BuckHashSys seem to be rare. See https://dtdg.co/go-delta-profile-docs for more information about delta profiles. Mitigates #1025 and implements PROF-4305.
Delta profiles require a certain amount of heap memory that is proportional to MemStats.BuckHashSys (typically ~2x). For applications that make heavy use of code generation or recursion the memory usage can be several hundred MB. Add WithDeltaProfiles() that allows to disable delta profiles for applications where this is a deal breaker. Those applications will still see allocation profiles, but they won't be able to aggregate or compare them. The default value for WithDeltaProfiles() is true since applications with high BuckHashSys seem to be rare. See https://dtdg.co/go-delta-profile-docs for more information about delta profiles. Mitigates #1025 and implements PROF-4305.
Delta profiles require a certain amount of heap memory that is proportional to MemStats.BuckHashSys (typically ~2x). For applications that make heavy use of code generation or recursion the memory usage can be several hundred MB. Add WithDeltaProfiles() that allows to disable delta profiles for applications where this is a deal breaker. Those applications will still see allocation profiles, but they won't be able to aggregate or compare them. The default value for WithDeltaProfiles() is true since applications with high BuckHashSys seem to be rare. See https://dtdg.co/go-delta-profile-docs for more information about delta profiles. Mitigates #1025 and implements PROF-4305.
Closing this since #1038 mitigates this problem and will be included in the next release. As discussed, there are more improvements that could be made here to lower the memory usage of delta profiles, but right now it seems like it would only benefit a very small number of customers (using code generation and/or lots of recursion). That being said, I'm happy to reopen this if more people chime in here. |
@natebrennand thanks, this feedback is really appreciated and helps to bump this higher on our priority list! One Q: In your case the memory overhead seems to be just 45 MiB? Is this prohibitive for your application? What would be an acceptable amount of memory usage for you? I realize your app has low memory usage in general (just 94 MiB total), so this is quite a lot relatively speaking, but overall it seems like you might not be hurting from using too much memory? PS: Sorry for the late reply, I was on vacation. |
Hi @felixge, no worries about timing, thank you for getting back to me. Fair questions, I believe this is mainly due to the time period that I took that screenshot of. Our primary application that we observed this issue on typically runs utilizing 100-150mb of memory, we limit it at ~500mb. |
Thanks for sharing those details @natebrennand. Looking at the screenshot, it seems like the actual OOMs happen when there is a significant spike of memory usage which is probably from spikes of traffic? Do you still see those spikes now? Do you have any rate/concurrency limits in place to avoid traffic spikes consuming too much memory resources? Anyway, 500 MB is a somewhat tight memory limit, so if you intend to keep it you might be better off without delta profiles for now. That being said, we definitely consider this to be a problem with the feature. I've just bumped it up on our internal roadmap, so hopefully we'll get a chance to iterate on it in the near future! |
Hey @felixge 👋 Was there any further development related to this? I'm observing similar behavior in one of the application that does not use recursion and/or a lot of code generation. Right now I'm observing growing allocations on the Heap in related to deltas and overall growing memory usage that is not allocated on the heap. It results in more frequent OOM kills. What would you advise in that case? |
We implemented #1511 which should have addressed the issue that was raised here. Below is an example data point of the impact this had on one of our internal Go applications:
If you're still seeing issues, it'd be great if you could raise a new issue with more details. But it sounds like you might have a memory leak on your end (I'm basing this on your "overall growing memory usage that is not allocated on the heap" comment)? |
@felixge Any ETA on when this fix will be present in a stable release? For the version that we're using in production, the changelog is massive: v1.67.1...1.999.0-beta.11 And as for the issue, we're also seeing a similar behavior where The resets in memory usage correspond to deployment rollouts. |
@PrayagS you might be affected by golang/go#69590. TL;DR: set the environment variable |
@nsrip-dd Thanks for getting back. We're on Go 1.21.9 and I can see that the stack depth is still 32 in that version (https://go.googlesource.com/go/+/refs/tags/go1.21.9/src/runtime/mprof.go#43). |
Ah, indeed. To answer your question "Any ETA on when this fix will be present in a stable release?", we have been using more memory-efficient profile processing code by default since v1.47.0, implemented in #1511, which Felix linked above. That fix is for the processing done by this client library. Unfortunately there's not much we can do in this library to affect the amount of memory used by the Go runtime for profiling (what you see in In the mean time, you could perhaps change the sampling rates for the profilers to collect data less often. For example, |
Thanks for getting back. Appreciate the detailed response.
Noted. Are there any scenarios in your experience where the application code can be affecting this behavior? Because IIRC we haven't seen this with Pyroscope's Go client. I'll try to test with it again to record more recent data; specifically |
Starting with version v1.33.0 there appears to be a memory leak in the profiler. I can't reproduce yet but I'm basing the observation on series of profile snapshots taken from a prod application that ran overnight and in the last 30 minutes was mostly idle.
I saw that there was a large refactor in v1.33.0 (#842) but I'd like not to downgrade to v.1.32.1 because we really need the other memory related fix from #976
The text was updated successfully, but these errors were encountered: