-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hundred of users leads to running out of tens of thousands of ephemeral ports #557
Comments
Based on #388 (comment), I think this may not be an issue with CHP as much as the software running in the user servers leading to a flood of connections be initiated via the UI.
|
@felder this is a followup to #388 (comment). I inspected two active deployments with 222 and 146 currently active users respectively. A hub where users access either /tree or /labFrom inspection, it seems this makes use of jupyter_server 2.12.1 and jupyterlab 4.0.9. This is from a CHP pod with a hub currently having 222 current user pods running the image
|
This doesn't rule out CHP- to do that you'd need to compare this with another proxy like Traefik. For example, if CHP isn't closing connections as fast as the browser this could lead to too many ports in use. Do the existing CHP tests cover HTTP persistent connections? |
One thing I'm noticing as I investigate is that user servers that use lab (as opposed to rsession-proxy or the like) interact with the hub pod a lot more often. Anytime I interact with the file browser, launcher, etc last_activity for the hub pod route in chp updates. This is not the case if /rstudio is designated as the default URL. Additionally the ESTABLISHED connection count to hubip:8081 with a single user pod running lab (as opposed to rstudio) increments pretty steadily as I do things like kill the pod, kill the kernel, refresh the browser, etc. |
i believe this might be happening... if a user closes their laptop, or opens their notebook in a new browser (which happens more often than you'd imagine) we see a lot of spam (hundreds of 503s being reported) in the proxy logs:
|
Hmmm, so we have a spam of After that, jupyterhub asks CHP to delete the route. After that, I expect the thing that got 503 now won't get 503 responses because the proxy pod won't try to proxy to the route any more, instead it will do something else --- maybe redirect to the hub pod as a default route - which then gets spammed. @shaneknapp I guess that we can see some redirects with debug logging or similarly - or can we see redirect responses from CHP already and we aren't seeing them? |
I think From the logs i see one failed request every ~10ms five times in a row, which I guess means no delay between re-attempts etc.
@minrk I recall that you submitted a PR somewhere, sometime a while back, about excessive connections or retries. Was this to this endpoint? |
So when running lab, when I do things like kill my pod or start up another connection from another tab or browser I tend to be able to get chp to emit 503 messages similar to:
This does make sense when I'm killing my user pod since the server is no longer there at that ip. However, when this happens I see a correlated increase in the number of established connections from chp->hub:8081. Those connections seem to persist. |
Noting that if I delete the route to the hub pod in chp, the connections still persist. |
What version of jupyterlab and jupyter server / notebook is used? |
The original issue was jupyterlab/jupyterlab#3929 I think JupyterLab is supposed to stop checking the API when it realizes the server is gone. But 503 means JupyterHub thinks the server is there when it's not, and 503s should be retried after a delay (I wouldn't be surprised if jupyterlab probably still retries a bit too fast). When the server is actually stopped (i.e. the Hub notices via poll and updates the proxy to remove the route), these 503s should become 424s, at which point perhaps JupyterLab may slow down/pause requests as it should. I don't expect the singleuser server is making "hairpin" connections to the Hub via CHP, unless things like I have a strong suspicion that if this tends to happen in JupyterLab and not other UIs, it is related to JupyterLab's reconnect behavior, the use of websockets (less common in other UIs), or both. My hunch is something like this:
@shaneknapp @felder do you have any indication that CHP starts seeing 503 errors before problems start, i.e. that it might be a cause and not merely a symptom? Short-term 503 errors are 'normal' behavior when a server shuts down prematurely, e.g. due to internal culler behavior, so if that triggers a cascade of too many bound-to-fail requests that don't get cleaned up fast enough, that seems a plausible scenario, at least. |
@consideRatio jupyterlab 4.0.11 and 4.2.5, notebook 7.0.7 and 7.2.2 @minrk It's possible, honestly we're just trying to wrap our heads around the ephemeral port issue and chp so all possibilities are on the table. |
data8
data100
datahub
we actually bumped data100 to the most recent versions of these packages yesterday "just to see" if it helped. however, we allocated 16G of ram per user for an assignment as of this morning, and expect fewer kernel crashes and therefore less orphaned (or excess?) ephemeral ports. |
Can we test this without JupyterHub? Run JupyterLab, manually start CHP, create a CHP route to JupyterLab, and access JupyterLab via CHP. Based on the above if you open another tab, or Ctrl-C and restart JupyterLab, the number of ports in use should significantly increase. |
well, it usually takes a few hours for the "problems" to start as the ports begin to accumulate. during this time (aka all day) we are seeing plenty of 503s. the rate at which they are happening isn't really something that i'd say is trackable as they occur based on how users are interacting with the system (closing laptop, reopening later, new browser tabs etc). and, of course, the more users, the more 503s. and as @felder said, we're still trying to unpack what's really happening here and wrapping our heads around the 503s and the causation/correlation relationship to them and the chp running out of ephemeral ports. |
welp, this just happened on a smaller hub w/about 75 users logged in. the outage started at 330pm, and lasted 15m. users got a blank page and 'service inaccessible'. you can see the cpu peg at 100% during this time, and eventually it recovered. the proxy ram usage doesn't hit anywhere near what we've allocated as max (3G), but there's some interesting ups and downs in that graph. this is a more complex hub deployment, w/two sidecar containers along each user container -- one w/mongodb, and the other w/postgres installed. |
This is the simplest CHP/JupyterLab setup I can come up with: Run CHP on default ports (8000 and 8001), no auth, log all requests:
Create a route
Start JupyterLab under
Open http://localhost:8000/test/ in a browser Show IPv4 connections involving ports 8000, 8001, 8888:
I've tried reloading JupyterLab in my browser, and killing/restarting JupyterLab. When JupyterLab is killed (Ctrl-C) with the browser still open CHP doesn't keep any sockets open. There's a burst of connections whilst it loads:
but it stabilises again:
I haven't seen any continual rise in the number of sockets if I repeat the reload/kill/restart cycle. |
@manics Just wanted to note that the connections we're concerned about are going to hubip:8081, not sure if that's represented in your test. The issue is not connections from chp to/from user pods but instead from chp to/from the hub pod. The userpod destinations vary enough where there is no concern about ephemeral ports there (as of yet). It's the hub pod which is a single destination/port pair where we see the issue. |
@felder I haven't tested that since CHP shouldn't be proxying pods to hub:8081, the pods should connect directly to the hub:8081. External Hub API requests will go through CHP though, this includes requests made through the Hub admin UI. Can you turn on debug logging in CHP, filter the logs for connections to |
@manics perhaps this comment describes what we are seeing better? I’ll see what debug logging reveals. |
From #388 (comment) and onwards is context on how a CHP pod can end up running out of ephemeral ports, with a mitigation strategy in #388 (comment).
The text was updated successfully, but these errors were encountered: