-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Socket leak #434
Comments
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗 |
This sounds similar to #388 Would you mind adding your comment on that issue and closing this one to avoid duplicates? Thanks! |
@manics that looks like a memory leak while this is a socket leak. Should be different no? |
We just saw a node with a lot of proxy pods spew a lot of |
This comment was marked as resolved.
This comment was marked as resolved.
ws release is unlikely to be relevant, since it's only used in tests. |
I think the configurable-http-proxy/lib/configproxy.js Lines 233 to 245 in cb03f77
|
I think this remains a problem in 4.6.1 using node 20 in on GKE with linux kernel 5.15 based on info from @shaneknapp. |
CHP has multiple node HTTP servers working in parallell, one for its own REST API, one for proxying, one for metrics. It would be good to conclude if the growing tcp memory / sockets etc are associated with a specific instance of these. |
Looked at one CHP process and saw for example...
This wasn't expected to have anything close to 500 open connections or similar, so I think its very safe to say that this reproduces. This is from latest chp running with node 20 on linux kernel 5.15 nodes. |
I'm not sure when I expect a socket to be closed. When it times out based on a "timeout" option? I think the timeout option may be infinite. Is the issue that there is simply nothing that makes us destroy sockets once created, because we default to an infinite timeout? |
@minrk and others, is it safe to change a default for a timeout value here to something quite extreme, like 24 hours? I don't want us to disrupt users that are semi active and runs into issues at the 24th hour - but they wouldn't as long as they are semi-active right? We have two timeout args matching the node-http-proxy options at https://github.com/http-party/node-http-proxy?tab=readme-ov-file#options, |
There is also a related issue reported in node-http-proxy - http-party/node-http-proxy#1510. The node-http-proxy was foked and had that issue fixed with a one line commit at Jimbly/http-proxy-node16@56283e3, to use a Looking closer at that fork, there is also another memory leak fixed in another commit according to the commit message: Jimbly/http-proxy-node16@ba0c414 . This is detailed in a PR as well: http-party/node-http-proxy#1559 Those two memory fix commits are the only actual fixes in the fork, where the rest is just docs etc. Maybe we should do a build of chp based on the forked node-http-proxy project and push a tag that users can opt into? Like |
I pushed If someone wants to try if this fork help, just reference 4.6.1-fork instead as a image tag. |
Nice research! I don't think we should have a separate If we think the fork is production ready then we should either switch CHP to use it if it's fully trusted, or vendor it if it's not (as previously discussed in #413 (comment)) |
i'll be testing out edit: here are my changes in case anyone wants to see them! |
If it works, 👍 to just switching the dependency and publishing a new release, without any I think the sustainable longer-term solution is to vendor http2-proxy, which I started here but haven't had time to finish. It would be great to be able to have some actual tests to exercise these things, since it's been hard to control or verify. |
ok, i identified the most problematic hub... the chp pod been getting OOMKilled and stack tracing at least every 2-3 days. i just deployed the test fork of the chp pod to it and will keep an eye on things over the rest of the week. in other news, this fix seems (read: seems) to use less tcp memory than before. it's tough to say for certain, but at the very least w/my latest deployment on the problematic hub i'll have something that mildly on fire to watch, vs the others i've deployed on much less trafficked hubs. 🤞 |
womp womp. that pod has restarted three times in the past hour after deploying the -fork chp:
i'm also seeing a lot of these
and sometimes after it's killed (but not every time) we get the following in dmesg:
are we DOSing ourselves somehow? |
i killed that pod and the so. confusing. |
nope. that pod's chp is still getting OOMKilled. time for a break. :) |
What is its normal memory use and what its k8s request/limit in memory? |
i've reverted that node is still regularly hitting the max memory, but just not as quickly(?) as with
i looked at chp pod restarts across our hubs. we're seeing this intermittently across our biggest deployments (both 4.6.1 and 4.6.1-fork... which all run on the same core node). the smaller usage hub's chps usually run between ~200-300MiB (~200 users max at any time, no massive spikes in logins etc). the larger hubs run ~400+ MiB but depending on user count, the chp pods eventually run out of memory and are OOMKilled. today was spent looking in to our largest class, but the other two had definitely been experiencing the same issue to slightly lesser impact. now that i know what to look for, i'll keep an eye on these individual pods' memory usage and know better what to look for in the logs. the biggest takeaway so far is that |
i also suspect the disclaimer: i have a few tabs open and suspect that the former might be something isn't cleaning up when something in the proxy stack goes away. the latter might just be a symptom of this, but i can neither confirm nor deny. https://en.wikipedia.org/wiki/Martian_packet so martians are involved somehow, causing network flooding and a DOS? |
@consideRatio i've rolled back the 11 hubs that had after rolling back to le sigh. :( |
@shaneknapp is this fork vs non-fork only, or fork+timeout flags vs non-fork? |
fork + timeouts. forgot about the timeouts, actually! maybe it's that!
fwiw it's been a hell of a week lol.
…On Thu, Feb 8, 2024 at 10:42 PM Erik Sundell ***@***.***> wrote:
@shaneknapp <https://github.com/shaneknapp> is this fork vs non-fork
only, or fork+timeout flags vs non-fork?
—
Reply to this email directly, view it on GitHub
<#434 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMIHLD7H346B7WHJVO6BH3YSXAOBAVCNFSM6AAAAAARDWEGZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZVGQYTCMBRHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
btw i rolled the fork back out to a couple of smaller hubs, minus the timeout settings. everything seems cromulent, but the only way to really test this is to have a lot (200+) of people logging in within a short period of time. |
quick update here: this fix really does look promising. orphaned sockets seem to drop significantly, and memory usage doesn't explode wildly and cause users to receive 500s. |
huzzah! Thank you so much for testing and reporting, @shaneknapp. @consideRatio do you want to switch the dependency to the fork and make a real release of CHP? |
@minrk -- while i'm confident it helps, i'm also confident that it doesn't fix the problem outright. while we're not getting the spiky and constant OOMKills w/this test fork, there is still a pretty significant memory leak somewhere: i checked our javascript heap as well (this is in Mi):
as you can see we're bouncing against that pretty quickly (we had a 3-day weekened this week so the figures are a little smaller than usual): so there is still a significant memory leak. maybe we're exposed another bug during the testing... we're also seeing many 503s for ECONNREFUSED on our two biggest hubs. these pop up after che chp has been at the heap limit for a couple of hours and it looks like people's proxy from the core node (w/the hub and chp pods) is disappearing. this is SUPER disruptive and is impacting coursework.
another quick update: the 503 errors we're getting are appearing on high traffic hubs running both the vanilla and fork versions of the chp. they're appearing in multiples of 30 (30 or 60). i think this behavior might be related to, but not caused by the chp. |
i just deployed the fork to prod for all of our hubs -- the fork seems to be holding up quite well on the high-traffic hubs (>1k users/day, high compute loads), so now we're rolling it out for the rest. if this continues to squelch the memory spikes/OOMKills for another week i'd feel comfortable giving my thumbs-up to roll the fork in to a new release! |
alright, it's been a week and i feel very comfortable in saying that we should definitely roll these changes in to a release branch asap. it doesn't fix the problem outright (our highest-traffic/load hubs still have one or two chp OOMKills per day w/250+ concurrent users), but it's a significant improvement over vanilla 4.6.1! i firmly believe that we should still investigate further, and even after deploying 4.6.1-fork, we had another OOMKill/chp outage on march 5th that impacted ~300 users. yesterday, i sat down w/GCP and core node kernel logs, plus grafana/prom data, and put together a minute-by-minute timeline of how things went down. since i'll be on a plane for a few hours today, i'm hoping to get this transcribed from paper to an update on this github issue to help w/debugging. TL;DR: |
quick ping here... is the |
@shaneknapp thanks for the ping. I opened #539, then we can make a release with it. |
@shaneknapp Since it improves but doesn't fully fix the issue it might be worth also testing an older CHP image (or building your own) based on NodeJS <=15.4 |
4.6.2 was released 2 months ago with a fix for the leaking sockets. Can we close this issue? |
Bug description
We are running z2jh (https://z2jh.jupyter.org/en/stable/) and found there is a socket leak in
proxy
pod.The number of socket is constantly increasing (over 60k) and the kernel generates an error after a week ,
kernel: TCP: out of memory -- consider tuning tcp_mem
.I have checked the number of sockets using lsof.
Your personal set up
This chart use the proxy docker image from jupyterhub/configurable-http-proxy:4.5.0
The config.yaml related to proxy
The text was updated successfully, but these errors were encountered: