-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in proxy? #388
Comments
We're running z2jh chart version |
We start seeing serious performance problems at about 1.5GB, which is suspiciously close to the heap limit for node 🤔 So maybe its a memory leak that then cascade fails at the heap limit into some sort of .... garbage collection nightmare? or? |
Do you happen to know if the memory increases are correlated with particular events, e.g. a user starting a new server, or connecting to a particular service? |
No, but I'm looking into it, my vague suspicion: websockets? We push them pretty hard, e.g. many users are streaming VNC over websocket. Is there a log mode that has useful stats about e.g. the routing table? |
OK, so a further development, since high RAM usage correlated with performance problems, I added a k8s memory limit to the pod, thinking it would get killed when it passed 1.4GB of RAM, and reboot fresh, a decent-ish workaround for now. Note that there's one other unusual thing here, I Did something change or did adding a k8s memory limit suddenly change the behavior? |
(note this otherwise consistent memory growth pattern goes back to jan, and a number of version upgrades since from the z2jh chart..... this is.... weird) |
Hmmm, so when rhe pod restarts, is it because it has been evicted from a node, or is it because it has restarted its process within the container etc? Being evicted from a node can happen based on external logic, while managing memory within the container can happen because of more internal logic, which can be enabled by limits to clairfy it needs to not surpass certain limits. Need to learn more about OOMkiller things within the container vs by the kubelet etc, but perhaps you ended up helping it avoid getting evicted by surpassing its memory limit. Hmmm.. |
@snickell was what you observed related to load at all? Like, on weekend days do you observe this behavior? We're currently experiencing relatively-speaking high load on our deployment, and I observe something similar. Memory consumption in the proxy will just suddenly shoot up and it becomes non-responsive. Are you still using CHP for your proxy? I am considering swapping it for Traefik in the coming days here. |
@snickell have you experienced this with older versions of z2jh -> chp as well? |
Still happening on the latest version (v4.5.6). |
see also #434 i believe the socket leak is the root cause of the memory leak. on our larger, more active hubs we've seen constant spiking of the "load" is ~300+ users logging in around the "same time". "same time" is anywhere from 15m to a couple of hours. i don't believe that increasing the |
We finally replaced chp with traefik in our z2jh deployment, and this problem got obviously fixed.😬 Check out that alternative just in case you are experiencing this. |
thanks, good to know. we've also been considering this as well.
…On Fri, Jun 7, 2024 at 7:10 AM Marcelo Fernández ***@***.***> wrote:
We finally replaced chp with traefik in our z2jh deployment, and this
problem got obviously fixed.😬
Check out that alternative just in case you are experiencing this.
—
Reply to this email directly, view it on GitHub
<#388 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMIHLEJDD5VQBI52PJGGKTZGG5L3AVCNFSM6AAAAABI5H3Y42VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJUHEZDOOJWGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@marcelofernandez are you able to share config for your setup? |
echoing @consideRatio -- do you have any relevant traefik config bits you could share? this would be super useful! :) thanks in advance... |
Hey guys, sure! First, and foremost, I'm sorry I can't give you all the details of my company's internal PR because:
That said, I can give you an overview of what I did. The complicated part was that it seemed like nobody did this in the past, so I based my job on this (far more ambitious) previous and rejected PR which originally was aimed to replace both proxies:
The only thing I did (because I only wanted stability) based on that PR was to:
Based on the Z2JH's architectural graph, here are the changes. Once I defined the idea of what I wanted, I had to drop unneeded code from the PR above, configure the hub to call the proxy in the same pod ( I implemented this like a year and a half ago, if you have more questions, just let me know... Regards |
4.6.2 was released 2 months ago with a fix for the leaking sockets. Is there still a memory leak or can we close this issue? |
@manics i don't think we should close this yet... we still saw i'm sure that within a few weeks we'll see OOMs/socket leaks once the fall term ramps up. |
If anyone can make a stress test to provoke this, ideally with just CHP (or the JupyterHub Proxy API, like the traefik proxy benchmarks) I can test if the migration to http2-proxy will help. I tried a simple local test with a simple backend and apache-bench, but many millions of requests and hundreds of gigabytes later, I see no significant increase in memory or socket consumption (still sub-100MB). So there must be something relevant in typical use (websockets, connections dropped in a particular way, adding/removing routes, etc.) that a naïve benchmark doesn't trigger. |
re benchmarking... we really don't have either the cycles, available staff or deep understanding of how the proxy works to do this. re the something: we're seeing mildly improved performance w/4.6.2 but are still experiencing pretty regular, albeit much shorter (and self-recovering) outages at "peak"[1] usage. [1] peak can be anywhere from ~200 up to ~800 users on a hub. for example, last night between 845p and 9p, we had ~188 students logged on to datahub (the lowest end of 'peak') and saw the proxy peg at 100% CPU and 1.16G ram. hub cpu hovered around ~40% until the outage, and during that 15m dropped to nearly 0%. hub memory usage was steady at around ~477M only during the 15m of the outage (~845p - 9p) our not surprisingly, the readiness probes couldn't find either the hub or proxy during this outage (and also the hub just past when things recovered?): i'll dig more through the logs and see what needles i can winnow out of the haystacks. |
running
this repeated regularly for about 30m, and then for another 30m lots of messages like this:
during all of this, the so: something is still amiss. |
We have discussed two kinds of memory, network tcp memory, and normal ram memory - getting normal memory killed is a consequence of surpassing requested memory via k8s. This is normal memory killing i think, so what memory request is configured for the peoxy pod? Note that i think the graph you have in grafana may represent an average combination of pods if you have multiple proxy pods in the k8s cluster, so then you could see memory usage below the requested amount even though an individual pod goes above it. I recall an issue opened about this... Found it: jupyterhub/grafana-dashboards#128 Is normal memory still growing without bound over time as users comes and goes, and chp getting memory killed for that reason, making requesting more memory just a matter of gaining time before a crash? |
nope -- this is only one deployment, not a sum of them all.
it seems to grow over time, and as it surpasses the |
fwiw i'm about to bump this to 3Gi "just to see". |
this just happened again and it really looks like we're running out of ephemeral ports... |
we're actually thinking about putting our |
currently on the impacted proxy node:
|
hmm, well, this sheds some light on things: #465 some of our hubs have > 1000 users. |
Noticed that when this behavior happens at berkeley we see accompanying logs indicating, EADDRNOTAVAIL and tons of connection failures to the hub pod. Really like the issue described here Running netstat -natp on the chp proxy indicates it’s very possible we have enough connections to be running out of ephemeral ports. Would likely explain the sudden increase in cpu as well because I think proxy rapidly retries when this behavior starts. |
for those continuing to be held in rapture by this enthralling story, i have some relatively useful updates!
point 4 is important... and makes me wonder if something in lab or notebook has buggy socket handling code. |
A big difference between hubs that use one of the mentioned proxies and others that don't is that instead of a majority of the connections going from chp -> hub:8081, the connections instead go to the user pods directly. Since the user pods are spread out across a subnet, the connections aren't all focused on a single ip:port destination and so there is no issue with ephemeral port exhaustion. |
yep, thanks for clarifying @felder ! |
About where connections go etc, i expect:
If you could figure out something more about where things go that amounts too huge requests, that may allow us to tune lab/hub etc, for example how often user server reporta activity to hub can be reduced i think. I also think lab checks connectivity with hub regularly. |
@consideRatio Do you have ideas for determining more information regarding these connections? Currently we're just looking at reporting from netstat and seeing the ip:port pairs for source and destination on the connections. Here's sample output for what we're seeing with netstat on the chp pod, which unfortunately is not particularly helpful with regard to getting more specifics.
In this case 10.28.7.157 is the chp pod ip, I replaced the actual ip for the hub with "hubip" and we can see a few connections going to user pods on port 8888. What I do not know at this time is how to figure out why the connections were opened in the first place, or what user/process they are associated with. We're just getting started on this line of investigation (we only discovered the ephemeral port exhaustion late last week) at this time and while I would love to give you the information you're requesting, I don't know how to obtain it. Also, out of curiosity do you see a similar ratio as we do on your deployments with regard to ephemeral port usage? We're seeing this issue primarily on hubs with > 200 users which would suggest that on a per user basis there's ~100 connections from the chp -> hub:8081 but again I have no way at this time of associating the ephemeral ports with anything meaningful so for all I know it could be some activity on the part of a subset of users or processes. |
@shaneknapp What image(s) do you use?
|
@shaneknapp If ephemeral port exhaustion is not caused by @snickell Could you test, if there is still a memory leak with v4.6.2? IMHO these are separate issues. |
all of our images are built from scratch... some are dockerfile-based, most are pure repo2docker, and everything is built w/r2d. of the three hubs most impacted by this, one has a complex Dockerfile-based build, and the other two are very straightforward python repo2docker builds.
well, until literally late last thursday we had no idea what was going on, and we still don't know what is causing the port spam yet. if you have any suggestions i'd be happy to move the conversation there. perhaps #434 ?
|
No. |
until we figure out a better home for this issue, i will continue to update our findings here. :) anyways...
deployed via this gives us 55000 (65000 - 10000) ephemeral ports. |
I figure we have the following kinds of issues, I opened #557 to repsent the third kind - lets switch to that!
|
Aloha, we've been seeing a pattern of growing daily memory usage (followed by increasing slugishness then non-responsiveness above around 1-2GB of RAM) in the 'proxy' pod:
The different colors are fresh proxy reboots, which have been required to keep the cluster running.
-Seth
The text was updated successfully, but these errors were encountered: