-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heap size not coming down after objects are removed #685
Comments
Which kind of hook are you using? Also, your collection and hooks are long lived objects in memory, so it will never go down entirely. However, I can attest to the leak to some degree. If I have a long lived tile38 leader that runs for weeks without a restart and a couple of 100k kafka hooks, at some point it will run OOM. Only a restart will free up memory again. I can imagine that the issue becomes more apparent the more you're storing and the more hooks you have. I think we had multiple threads on slack about that already, and some issues here on Github. While some culprits have been found, either on my side or with the underlying kafka library sarama (remember prometheus leaking memory), it never truly was fixed. |
I am using kafka hooks and the expiry of that hook in my system is of around 1 year at the moment. I am ready to change that if that helps in memory management. But i dont think that helps. Even with around 100k hooks i am facing this issue. Is there any approach that can help solve this memory issue. |
Based on the steps above I cannot reproduce the same issue on my side.
The heap should be going up and down all the time. The GC command only forces an immediate garbage collection, but there will be automatic garbage collection happening continuously in the background. what I triedI opened a terminal and started a fresh tile38-server instance:
Then I opened up another terminal and polled for heap_size every 1 second.
# example output
3724152
3804120
3952664
4100920
4249160
3423384
3572968
... You will see with an idle system the heap_size will continuously grow, but then suddenly shrink, then grow and shrink again. Then I opened a third terminal and issued the following commands:
Those commands have very little effect on the overall heap_size. And the system still seems stable. Now from a fourth terminal I insert about 100k random objects using the tile38-benchmark tool.
And the heap_size suddenly jumps up significantly, as expected: 3584064
3734600
31831976 # <-- benchmark started
43412232
92726424
75022664
82000280 Now I reissue the FLUSHDB, AOFSHINK, GC and it goes down again. 96183288
96331576
96553400
96702344
3994632 # <-- GC
4146840 If it's related to the Kafka plugin or something else, then I would absolutely like to find a way to reproduce the issue and plug the leak. |
@tidwall Initially i thought this is happening with all the objects, but after @iwpnd suggested that this could be something specific to hooks I checked that and it is.
As u can see i have added 100k hooks
Same i did with just adding normal geofences using SET and after the flush commands heap size came down. If this code snippet is not enough i can create a small spring-boot server with the required code and push to github so you can check properly. |
tile38/internal/endpoint/endpoint.go Line 148 in a08c55b
If every hook has its own connection and each connection is an entry in that map then adding new hooks over the course of a Tile38 leader lifetime will leak memory eventually because it will never be freed up - or I don't see where. Upon delete of the hook the item is removed from the collection, but the connection remains in that map. Does it not? |
Hooks do not manage connections. That's the responsibility of the From the hook's perspective an endpoint is just URL string (or multiple strings if the hook uses failover). When a hook must send an event, that event, plus the endpoint string, is sent to the Manager. tile38/internal/server/hooks.go Lines 696 to 707 in a08c55b
The Manger will then check if there's a usable connection already mapped to that endpoint string. That's the map you are talking about. If the connection is usable then the event is sent, otherwise a new connection is opened and assigned to the endpoint string, and then the event is sent. Those connection open on demand, and are automatically closed when idle after a number of seconds, usually 30. tile38/internal/endpoint/kafka.go Line 19 in a08c55b
There will only ever be one connection per unique endpoint. So if you have 100k hooks using the same exact endpoint URL string, then all events for all hooks will pipeline through that connection. Which is the case in @Mukund2900's example above. |
I pushed a change to the master branch that specifically addresses the issue from @Mukund2900 example above. It appears that the FLUSHDB did not cleanup all the memory referenced by a hook. Now it does. |
@tidwall hope this will get triggered when specific hooks expire or when I run gc or AOFSHRINK . Because in real world scenario that is how I expect it to work I.e. free up memory on hook expiry. To showcase the issue I used FLUSHDB. Thanks a lot for quick support and response. |
Ah gotcha!! It's unique connections, not duplicate per hook. |
@tidwall with the changes you have made I still see memory not releasing completely. See the cmds ->
After the hooks expire ->
Running GC & AOFSHRINK ->
Some amount of memory is released but still 70% memory is not.
As you can see same response memory is not releasing completely after all this. |
Uploading the memory profiling data if that helps. I see the problem now, i think this is because I am using WHEREEVAL conditions with the hooks, which is causing this issue. Same test i did where i saved hooks but without any where/whereeval conditions this time.
while building the geofences i was adding one of these commands which is the root cause
|
Adding the following command while flushing all data and does the job, But this will help only when we FLUSHDB. There needs to be a function trigger which will delete the script when a hook corresponding to the same is removed. |
What's happening is a tile38/internal/server/token.go Line 429 in 5642fc4
Then the next time a new WHEREEVAL is encountered with the same script/sha1 the existing lua script is used instead of having to compile a duplicate. This is great for performance when there are many search queries with the same WHEREEVAL, but those scripts are not removed until This leaves all those script unused in memory. The WHEREEVAL is the same as calling the The easy solution is to just not cache for WHEREEVAL by removing the line above I think a more robust solution is to cache the WHEREEVAL scripts in their an LRU cache, instead of sharing the with the I just pushed an change that does that. |
@tidwall can you please release a version with these changes? |
@iwpnd Most importantly is reproducing on my side. If the graphs you provided are based on some mocked up test code, perhaps I can use that to diagnose the issue? |
Are you referring to the changes in the master branch that I pushed yesterday? |
@tidwall yes |
I'm sorry that was really not helpful. I was dumping information and was interrupted providing additional context.
Yes I do. it is the only difference between the two microservices.
This is production data with 100k kafka hooks, and 280k iots that are SET approx every 10seconds.
this is on 1.30.2 stable
Unfortunately it's production code I cannot share 1 to 1, but
That's the setup we're running in a nutshell. |
No worries. The extra context is helpful. I'll poke around and see happens. |
@tidwall with scale over time i am facing the above referenced issue where after the hooks expire memory is still not coming down.
This is just after 2 min of running AOFSHRINK and GC. And now when i run again here is the data ->
|
Describe the bug
I am running tile38 on production. For my use cases I have 1 million hooks and geofences defined. Everything works as expected except that I am getting memory issues.
Heap size is never coming down, even after running manual GC it is not releasing the heap.
To Reproduce
Steps to reproduce the behavior:
{"ok":true,"stats":{"aof_size":0,"avg_item_size":0,"cpus":2,"heap_released":204906496,"heap_size":2152414904,"http_transport":true,"id":"d8d9d1384aecd4f2a9366f3a26a4ad71","in_memory_size":0,"max_heap_size":2147483648,"mem_alloc":2152414904,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":22895,"pointer_size":8,"read_only":false,"threads":8,"version":"1.30.2"},"elapsed":"302.034µs"}
Everything comes down to 0 except the heap_size which is not coming down. Only after restart it comes down again
After restart ->
{"ok":true,"stats":{"aof_size":0,"avg_item_size":0,"cpus":10,"heap_released":1253376,"heap_size":4058800,"http_transport":true,"id":"1bea22bb0e8946896b1ff9f0024a9133","in_memory_size":0,"max_heap_size":0,"mem_alloc":4058800,"num_collections":0,"num_hooks":0,"num_objects":0,"num_points":0,"num_strings":0,"pending_events":0,"pid":37223,"pointer_size":8,"read_only":false,"threads":16,"version":"1.30.2"},"elapsed":"682.042µs"}
Expected behavior
Ideally it should come down, or after running GC it should release it. But does not
Operating System (please complete the following information):
How can i fix this? Am i missing anything ?
The text was updated successfully, but these errors were encountered: