Socket leak #434

jyounggo · 2022-10-12T21:56:14Z

Bug description

We are running z2jh (https://z2jh.jupyter.org/en/stable/) and found there is a socket leak in proxy pod.
The number of socket is constantly increasing (over 60k) and the kernel generates an error after a week ,kernel: TCP: out of memory -- consider tuning tcp_mem.

I have checked the number of sockets using lsof.

/srv/configurable-http-proxy $ lsof
1	/usr/local/bin/node	socket:[48829679]
1	/usr/local/bin/node	socket:[48829681]
1	/usr/local/bin/node	socket:[48825415]
1	/usr/local/bin/node	socket:[48825417]
1	/usr/local/bin/node	socket:[48829792]
1	/usr/local/bin/node	socket:[48829790]
1	/usr/local/bin/node	socket:[48829783]
1	/usr/local/bin/node	socket:[48829785]

/srv/configurable-http-proxy $ lsof | wc -l
64708
/srv/configurable-http-proxy $ lsof | wc -l
64719

Your personal set up

Version(s): Jupyterhub Helm chart: v1.1.2 ( https://jupyterhub.github.io/helm-chart/)
This chart use the proxy docker image from jupyterhub/configurable-http-proxy:4.5.0

The config.yaml related to proxy

proxy:
  secretToken: xxxx

  service:
    loadBalancerIP: x.x.x.x
  https:
    enabled: true
    hosts:
     - "exmaple.com"
    letsencrypt:
      contactEmail: "[email protected]"
  chp: # proxy pod, running jupyterhub/configurable-http-proxy
    livenessProbe:
      enabled: true
      initialDelaySeconds: 60
      periodSeconds: 20
      failureThreshold: 10 # retry 10 times before declaring failure
      timeoutSeconds: 3
      successThreshold: 1
    resources:
      requests:
        cpu: 1000m # 0m - 1000m
        memory: 5000Mi # Recommended is 100Mi - 600Mi -- we seem to run at 4.3GB a lot
  traefik: # autohttps pod (optional, running traefik/traefik)
    resources:
      requests:
        cpu: 1000m # 0m - 1000m
        memory: 512Mi # 100Mi - 1.1Gi
  secretSync: # autohttps pod (optional, sidecar container running small Python script)
    resources:
      requests:
        cpu: 10m
        memory: 64Mi

The text was updated successfully, but these errors were encountered:

welcome · 2022-10-12T21:56:16Z

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.

You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

manics · 2022-10-12T22:04:27Z

This sounds similar to #388

Would you mind adding your comment on that issue and closing this one to avoid duplicates? Thanks!

yuvipanda · 2022-10-31T17:15:00Z

@manics that looks like a memory leak while this is a socket leak. Should be different no?

yuvipanda · 2022-10-31T17:17:27Z

We just saw a node with a lot of proxy pods spew a lot of out of memory -- consider tuning tcp_mem while a lot of the proxies stopped working. Could be related?

minrk · 2023-02-06T08:48:04Z

ws release is unlikely to be relevant, since it's only used in tests.

consideRatio · 2023-11-27T13:22:44Z

I think the node version could be important debugging info because we are relying on node library's http(s) server here:

configurable-http-proxy/lib/configproxy.js

Lines 233 to 245 in cb03f77

    
           // proxy requests separately 
        
           var proxyCallback = logErrors(this.handleProxyWeb); 
        
           if (this.options.ssl) { 
        
             this.proxyServer = https.createServer({ ...this.options.ssl, ...httpOptions }, proxyCallback); 
        
           } else { 
        
             this.proxyServer = http.createServer(httpOptions, proxyCallback); 
        
           } 
        
           // proxy websockets 
        
           this.proxyServer.on("upgrade", bound(this, this.handleProxyWs)); 
        
           this.proxy.on("proxyRes", function (proxyRes, req, res) { 
        
             that.metrics.requestsProxyCount.labels(proxyRes.statusCode).inc(); 
        
           });

CHP	node	Date
`4.6.1`	`v20.10.0`	Nov 27 2023
`4.6.0`	`v18.18.2`	Sep 19 2023
`4.5.6`	`v18.17.1`	Aug 10 2023
`4.5.5`	`v18.15.0`	Apr 3 2023
`4.5.4`	`v18.12.1`	Dec 5 2022
`4.5.3`	`v16.17.0`	Sep 9 2022
`4.5.2`	`v16.17.0`	Aug 19 2022
`4.5.1`	`v16.13.2`	Feb 3 2022
`4.5.0`	`v14.17.3`	Jul 19 2021

consideRatio · 2024-02-02T14:55:17Z

I think this remains a problem in 4.6.1 using node 20 in on GKE with linux kernel 5.15 based on info from @shaneknapp.

consideRatio · 2024-02-02T15:23:00Z

CHP has multiple node HTTP servers working in parallell, one for its own REST API, one for proxying, one for metrics.

It would be good to conclude if the growing tcp memory / sockets etc are associated with a specific instance of these.

consideRatio · 2024-02-02T16:02:20Z

Looked at one CHP process and saw for example...

# cd /proc/<pid of chp process>
cat net/sockstat
sockets: used 556
TCP: inuse 174 orphan 2 tw 36 alloc 1572 mem 40536
UDP: inuse 0 mem 6
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

This wasn't expected to have anything close to 500 open connections or similar, so I think its very safe to say that this reproduces. This is from latest chp running with node 20 on linux kernel 5.15 nodes.

consideRatio · 2024-02-02T16:30:28Z

I'm not sure when I expect a socket to be closed. When it times out based on a "timeout" option? I think the timeout option may be infinite.

Is the issue that there is simply nothing that makes us destroy sockets once created, because we default to an infinite timeout?

consideRatio · 2024-02-02T16:39:03Z

@minrk and others, is it safe to change a default for a timeout value here to something quite extreme, like 24 hours? I don't want us to disrupt users that are semi active and runs into issues at the 24th hour - but they wouldn't as long as they are semi-active right?

We have two timeout args matching the node-http-proxy options at https://github.com/http-party/node-http-proxy?tab=readme-ov-file#options, timeout and proxyTimeout - should we try setting them to 24 hours both, or just one?

consideRatio · 2024-02-02T16:51:26Z

There is also a related issue reported in node-http-proxy - http-party/node-http-proxy#1510.

The node-http-proxy was foked and had that issue fixed with a one line commit at Jimbly/http-proxy-node16@56283e3, to use a .destroy function instead of a .end function. I figure the difference may be that .end allows for re-use of a process file descriptor perhaps?

Looking closer at that fork, there is also another memory leak fixed in another commit according to the commit message: Jimbly/http-proxy-node16@ba0c414 . This is detailed in a PR as well: http-party/node-http-proxy#1559

Those two memory fix commits are the only actual fixes in the fork, where the rest is just docs etc.

Maybe we should do a build of chp based on the forked node-http-proxy project and push a tag that users can opt into? Like jupyterhub/configurable-http-proxy:4.6.1-fork image?

consideRatio · 2024-02-02T17:19:34Z

I pushed jupyterhub/configurable-http-proxy:4.6.1-fork and quay.io/jupyterhub/configurable-http-proxy:4.6.1-fork (amd64 only, not aarch64 also) where I just updated package.json to reference node-http-proxy16 instead and made it be imported in a .js file (commits seen here: https://github.com/consideRatio/configurable-http-proxy/commits/main/).

If someone wants to try if this fork help, just reference 4.6.1-fork instead as a image tag.

manics · 2024-02-03T15:36:51Z

Nice research! I don't think we should have a separate jupyterhub/configurable-http-proxy:4.6.1-fork image, it adds confusion.

If we think the fork is production ready then we should either switch CHP to use it if it's fully trusted, or vendor it if it's not (as previously discussed in #413 (comment))

shaneknapp · 2024-02-05T01:27:48Z

i'll be testing out jupyterhub/configurable-http-proxy:4.6.1-fork in some of our smaller hubs this week and will report back!

edit: here are my changes in case anyone wants to see them!

berkeley-dsep-infra/datahub#5501

minrk · 2024-02-05T13:17:48Z

If it works, 👍 to just switching the dependency and publishing a new release, without any -fork or other unusual tags.

I think the sustainable longer-term solution is to vendor http2-proxy, which I started here but haven't had time to finish.

It would be great to be able to have some actual tests to exercise these things, since it's been hard to control or verify.

shaneknapp · 2024-02-07T20:00:40Z

ok, i identified the most problematic hub... the chp pod been getting OOMKilled and stack tracing at least every 2-3 days. i just deployed the test fork of the chp pod to it and will keep an eye on things over the rest of the week.

in other news, this fix seems (read: seems) to use less tcp memory than before. it's tough to say for certain, but at the very least w/my latest deployment on the problematic hub i'll have something that mildly on fire to watch, vs the others i've deployed on much less trafficked hubs.

🤞

shaneknapp · 2024-02-07T21:28:39Z

ok, i identified the most problematic hub... the chp pod been getting OOMKilled and stack tracing at least every 2-3 days. i just deployed the test fork of the chp pod to it and will keep an eye on things over the rest of the week.

in other news, this fix seems (read: seems) to use less tcp memory than before. it's tough to say for certain, but at the very least w/my latest deployment on the problematic hub i'll have something that mildly on fire to watch, vs the others i've deployed on much less trafficked hubs.

🤞

womp womp. that pod has restarted three times in the past hour after deploying the -fork chp:

[Wed Feb  7 20:59:45 2024] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
[Wed Feb  7 20:59:45 2024] Call Trace:
[Wed Feb  7 20:59:45 2024]  <TASK>
[Wed Feb  7 20:59:45 2024]  dump_stack_lvl+0x4c/0x67
[Wed Feb  7 20:59:45 2024]  dump_header+0x53/0x240
[Wed Feb  7 20:59:45 2024]  oom_kill_process+0x10e/0x1d0
[Wed Feb  7 20:59:45 2024]  out_of_memory+0x496/0x5a0
[Wed Feb  7 20:59:45 2024]  ? mem_cgroup_iter+0x213/0x280
[Wed Feb  7 20:59:45 2024]  try_charge_memcg+0x7dd/0x9f0
[Wed Feb  7 20:59:45 2024]  charge_memcg+0x42/0x1a0
[Wed Feb  7 20:59:45 2024]  __mem_cgroup_charge+0x2d/0x80
[Wed Feb  7 20:59:45 2024]  handle_mm_fault+0x1088/0x15d0
[Wed Feb  7 20:59:45 2024]  do_user_addr_fault+0x279/0x4f0
[Wed Feb  7 20:59:45 2024]  exc_page_fault+0x78/0xf0
[Wed Feb  7 20:59:45 2024]  asm_exc_page_fault+0x22/0x30
[Wed Feb  7 20:59:45 2024] RIP: 0033:0x7e707bb07081
[Wed Feb  7 20:59:45 2024] Code: 48 ff c1 eb f3 c3 48 89 f8 48 83 fa 08 72 14 f7 c7 07 00 00 00 74 0c a4 48 ff ca f7 c7 07 00 00 00 75 f4 48 89 d1 48 c1 e9 03 <f3> 48 a5 83 e2 07 74 05 a4 ff ca 75 fb c3 48 89 f8 48 29 f0 48 39
[Wed Feb  7 20:59:45 2024] RSP: 002b:00007e707af865a8 EFLAGS: 00010202
[Wed Feb  7 20:59:45 2024] RAX: 000000fbf7993c48 RBX: 0000000000000044 RCX: 0000000000000050
[Wed Feb  7 20:59:45 2024] RDX: 0000000000000638 RSI: 000027b131f9cc38 RDI: 000000fbf7994000
[Wed Feb  7 20:59:45 2024] RBP: 00007e707af86640 R08: 000000fbf7993c48 R09: 000000fbf7993c40
[Wed Feb  7 20:59:45 2024] R10: 000000fbf7993c41 R11: 000027b131f9c879 R12: 00001fe7f33411c9
[Wed Feb  7 20:59:45 2024] R13: 000000fbf798dd50 R14: 00007e70760e2ed0 R15: 0000000000000640
[Wed Feb  7 20:59:45 2024]  </TASK>
[Wed Feb  7 20:59:45 2024] memory: usage 1048484kB, limit 1048576kB, failcnt 209
[Wed Feb  7 20:59:45 2024] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[Wed Feb  7 20:59:45 2024] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6efce583_ef72_49b2_a3a4_af239a6c15f1.slice:
[Wed Feb  7 20:59:45 2024] anon 982282240
                           file 0
                           kernel_stack 245760
                           pagetables 28696576
                           percpu 46240
                           sock 442368
                           shmem 0
                           file_mapped 0
                           file_dirty 0
                           file_writeback 0
                           swapcached 0
                           anon_thp 0
                           file_thp 0
                           shmem_thp 0
                           inactive_anon 982249472
                           active_anon 8192
                           inactive_file 0
                           active_file 0
                           unevictable 0
                           slab_reclaimable 16552048
                           slab_unreclaimable 44657080
                           slab 61209128
                           workingset_refault_anon 0
                           workingset_refault_file 0
                           workingset_activate_anon 0
                           workingset_activate_file 0
                           workingset_restore_anon 0
                           workingset_restore_file 0
                           workingset_nodereclaim 0
                           pgfault 4469590
                           pgmajfault 0
                           pgrefill 0
                           pgscan 0
                           pgsteal 0
                           pgactivate 1
                           pgdeactivate 0
                           pglazyfree 0
                           pglazyfreed 0
                           thp_fault_alloc 0
                           thp_collapse_alloc 0
[Wed Feb  7 20:59:45 2024] Tasks state (memory values in pages):
[Wed Feb  7 20:59:45 2024] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Wed Feb  7 20:59:45 2024] [3783484] 65535 3783484      257        1    28672        0          -998 pause
[Wed Feb  7 20:59:45 2024] [3825947] 65534 3825947   410957   250184 29212672        0           999 node
[Wed Feb  7 20:59:45 2024] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-5f8f41d80b0d3993bdb2aafc8f480cd15cc623f5cb512212d187f99bbe550a13.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6efce583_ef72_49b2_a3a4_af239a6c15f1.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6efce583_ef72_49b2_a3a4_af239a6c15f1.slice/cri-containerd-5f8f41d80b0d3993bdb2aafc8f480cd15cc623f5cb512212d187f99bbe550a13.scope,task=node,pid=3825947,uid=65534
[Wed Feb  7 20:59:45 2024] Memory cgroup out of memory: Killed process 3825947 (node) total-vm:1643828kB, anon-rss:958732kB, file-rss:42004kB, shmem-rss:0kB, UID:65534 pgtables:28528kB oom_score_adj:999

i'm also seeing a lot of these martian source messages in dmesg with this hub's IP:

[Wed Feb  7 21:17:15 2024] IPv4: martian source 10.20.3.65 from 10.20.3.8, on dev eth0
[Wed Feb  7 21:17:15 2024] ll header: 00000000: 42 01 0a 00 00 1d 42 01 0a 00 00 01 08 00

10.20.3.8 is an nginx-ingress-controller.

and sometimes after it's killed (but not every time) we get the following in dmesg:

[Wed Feb  7 21:36:18 2024] TCP: request_sock_TCP: Possible SYN flooding on port 8000. Sending cookies.  Check SNMP counters.

are we DOSing ourselves somehow?

shaneknapp · 2024-02-07T21:41:16Z

i'm also seeing a lot of these martian source messages in dmesg with this hub's IP:
[Wed Feb  7 21:17:15 2024] IPv4: martian source 10.20.3.65 from 10.20.3.8, on dev eth0
[Wed Feb  7 21:17:15 2024] ll header: 00000000: 42 01 0a 00 00 1d 42 01 0a 00 00 01 08 00
10.20.3.8 is an nginx-ingress-controller.

are we DOSing ourselves somehow?

i killed that pod and the martian source messages have stopped... lets see if this helps things.

so. confusing.

shaneknapp · 2024-02-07T21:53:09Z

i'm also seeing a lot of these martian source messages in dmesg with this hub's IP:
[Wed Feb  7 21:17:15 2024] IPv4: martian source 10.20.3.65 from 10.20.3.8, on dev eth0
[Wed Feb  7 21:17:15 2024] ll header: 00000000: 42 01 0a 00 00 1d 42 01 0a 00 00 01 08 00
10.20.3.8 is an nginx-ingress-controller.
are we DOSing ourselves somehow?
i killed that pod and the martian source messages have stopped... lets see if this helps things.

so. confusing.

nope. that pod's chp is still getting OOMKilled. time for a break. :)

consideRatio · 2024-02-07T22:06:56Z

What is its normal memory use and what its k8s request/limit in memory?

shaneknapp · 2024-02-07T22:35:25Z

chp pod is set to have 1Gi ram as the limit and request is 64Mi. it's definitely hitting the limit pretty regularly. edit: we expect normal usage to be ~500Mi.

there's also a corresponding cpu spike when it OOMs:

there are ~280 logged in to the hub currently.

shaneknapp · 2024-02-07T22:41:03Z

here's what overall tcp mem usage for the core node (as reported from /proc/net/sockstat) looks like:

you can see the overall memory used drop significantly each time i deploy the 4.6.1-fork chp to an existing hub.

all hub and proxy pods for any given hub are all on this node. it's a GCP n2-highmem-8.

shaneknapp · 2024-02-07T23:57:36Z

i've reverted 4.6.1-fork on that node. it was causing many users to get 503s, which happen during the chp restarting. a quick reload bring their notebooks back but it was seriously disruptive.

that node is still regularly hitting the max memory, but just not as quickly(?) as with 4.6.1-fork. i think.

~~we don't see this behavior on other chp deployments~~

i looked at chp pod restarts across our hubs. we're seeing this intermittently across our biggest deployments (both 4.6.1 and 4.6.1-fork... which all run on the same core node).

the smaller usage hub's chps usually run between ~200-300MiB (~200 users max at any time, no massive spikes in logins etc).

the larger hubs run ~400+ MiB but depending on user count, the chp pods eventually run out of memory and are OOMKilled.

today was spent looking in to our largest class, but the other two had definitely been experiencing the same issue to slightly lesser impact. now that i know what to look for, i'll keep an eye on these individual pods' memory usage and know better what to look for in the logs. the biggest takeaway so far is that tcp_mem is king, and if it wasn't for that we'd have had at least one big outage over the past few weeks. throwing ram at the problem isn't always the right solution, but at least we've been able to winnow out some of the impacts of heavy hub usage (and hopefully get some bugs fixed!). :)

shaneknapp · 2024-02-08T01:21:04Z

i also suspect the martian source and SYN_FLOOD errors are excellent breadcrumbs leading to bad behavior somewhere(s).

disclaimer: i have a few tabs open and suspect that the former might be something isn't cleaning up when something in the proxy stack goes away. the latter might just be a symptom of this, but i can neither confirm nor deny.

https://en.wikipedia.org/wiki/Martian_packet

so martians are involved somehow, causing network flooding and a DOS?

shaneknapp · 2024-02-08T01:38:04Z

current state of the most hub (~175+ users)...

it's been an afternoon, that's for sure. ;)

shaneknapp · 2024-02-08T21:15:14Z

@consideRatio i've rolled back the 11 hubs that had 4.6.1-fork deployed. it seems like when we had about 200 or more users logged in, the proxy pod's memory would spike and eventually get OOMKilled and restart the configurable-http-proxy process on the pod.

after rolling back to 4.6.1 i can see that the proxy pod's memory usage has much smoothed out compared to before.

le sigh. :(

yuvipanda · 2024-02-08T21:23:07Z

fwiw, here's a similar memory graph from one of the hubs I help run now, which sees about similar usage to some of the big berkeley hubs. This is over the last 30 days.

This is just regular 4.6.1

shaneknapp · 2024-02-08T21:33:00Z

fwiw, here's a similar memory graph from one of the hubs I help run now, which sees about similar usage to some of the big berkeley hubs. This is over the last 30 days.
This is just regular 4.6.1

i looked at our historical data, and the big hub in question yesterday typically requested ~750MiB at peak-usage times over the past 30 days, but 4.6.1-fork spiked to over 1GiB almost immediately:

consideRatio · 2024-02-09T06:42:29Z

@shaneknapp is this fork vs non-fork only, or fork+timeout flags vs non-fork?

shaneknapp · 2024-02-09T17:28:18Z

fork + timeouts. forgot about the timeouts, actually! maybe it's that! fwiw it's been a hell of a week lol.

…

On Thu, Feb 8, 2024 at 10:42 PM Erik Sundell ***@***.***> wrote: @shaneknapp <https://github.com/shaneknapp> is this fork vs non-fork only, or fork+timeout flags vs non-fork? — Reply to this email directly, view it on GitHub <#434 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMIHLD7H346B7WHJVO6BH3YSXAOBAVCNFSM6AAAAAARDWEGZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZVGQYTCMBRHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

shaneknapp · 2024-02-13T19:01:19Z

btw i rolled the fork back out to a couple of smaller hubs, minus the timeout settings. everything seems cromulent, but the only way to really test this is to have a lot (200+) of people logging in within a short period of time.

shaneknapp · 2024-02-21T18:11:37Z

quick update here: this fix really does look promising. orphaned sockets seem to drop significantly, and memory usage doesn't explode wildly and cause users to receive 500s.

berkeley-dsep-infra/datahub#5501 (comment)

minrk · 2024-02-22T08:02:20Z

huzzah! Thank you so much for testing and reporting, @shaneknapp. @consideRatio do you want to switch the dependency to the fork and make a real release of CHP?

shaneknapp · 2024-02-22T17:14:46Z

@minrk -- while i'm confident it helps, i'm also confident that it doesn't fix the problem outright.

while we're not getting the spiky and constant OOMKills w/this test fork, there is still a pretty significant memory leak somewhere:

i checked our javascript heap as well (this is in Mi):

/srv/configurable-http-proxy $ node -e 'console.log(v8.getHeapStatistics().heap_size_limit/(1024*1024))'
792

as you can see we're bouncing against that pretty quickly (we had a 3-day weekened this week so the figures are a little smaller than usual):

so there is still a significant memory leak.

maybe we're exposed another bug during the testing... we're also seeing many 503s for ECONNREFUSED on our two biggest hubs. these pop up after che chp has been at the heap limit for a couple of hours and it looks like people's proxy from the core node (w/the hub and chp pods) is disappearing. this is SUPER disruptive and is impacting coursework.

06:47:19.515 [ConfigProxy] �[31merror�[39m: 503 GET /user/<username>/api/kernels/e55827ba-035e-4bc2-94a3-8abc6e763896/channels connect ECONNREFUSED 10.20.0.226:8888

another quick update:

the 503 errors we're getting are appearing on high traffic hubs running both the vanilla and fork versions of the chp. they're appearing in multiples of 30 (30 or 60). i think this behavior might be related to, but not caused by the chp.

shaneknapp · 2024-02-29T16:03:23Z

i just deployed the fork to prod for all of our hubs -- the fork seems to be holding up quite well on the high-traffic hubs (>1k users/day, high compute loads), so now we're rolling it out for the rest.

if this continues to squelch the memory spikes/OOMKills for another week i'd feel comfortable giving my thumbs-up to roll the fork in to a new release!

shaneknapp · 2024-03-07T15:36:50Z

@consideRatio @minrk

alright, it's been a week and i feel very comfortable in saying that we should definitely roll these changes in to a release branch asap.

it doesn't fix the problem outright (our highest-traffic/load hubs still have one or two chp OOMKills per day w/250+ concurrent users), but it's a significant improvement over vanilla 4.6.1!

i firmly believe that we should still investigate further, and even after deploying 4.6.1-fork, we had another OOMKill/chp outage on march 5th that impacted ~300 users. yesterday, i sat down w/GCP and core node kernel logs, plus grafana/prom data, and put together a minute-by-minute timeline of how things went down. since i'll be on a plane for a few hours today, i'm hoping to get this transcribed from paper to an update on this github issue to help w/debugging.

TL;DR: chp is running out of heap space (792M) under load and going in to a ~45m death spiral before getting OOMKilled.

shaneknapp · 2024-05-22T21:37:13Z

quick ping here... is the 4.6.1-fork release going to be merged in to main any time soon?

minrk · 2024-05-23T10:34:56Z

@shaneknapp thanks for the ping. I opened #539, then we can make a release with it.

manics · 2024-06-07T14:58:46Z

it doesn't fix the problem outright (our highest-traffic/load hubs still have one or two chp OOMKills per day w/250+ concurrent users), but it's a significant improvement over vanilla 4.6.1!

@shaneknapp http-proxy-node16 fixes a change made in NodeJs 15.5+: #539 (review)

Since it improves but doesn't fully fix the issue it might be worth also testing an older CHP image (or building your own) based on NodeJS <=15.4

manics · 2024-08-13T13:28:15Z

4.6.2 was released 2 months ago with a fix for the leaking sockets. Can we close this issue?

shaneknapp · 2024-09-13T00:29:23Z

#388 (comment)

jyounggo added the bug label Oct 12, 2022

This comment was marked as resolved.

Sign in to view

consideRatio mentioned this issue Nov 27, 2023

Document what node version is used for various tags #517

Open

shaneknapp mentioned this issue Feb 2, 2024

[DH-3] try using a test fork of the chp berkeley-dsep-infra/datahub#5501

Merged

minrk mentioned this issue May 23, 2024

switch dependency to http-proxy-node16 #539

Merged

shaneknapp mentioned this issue Jun 6, 2024

Memory leak in proxy? #388

Open

Socket leak #434

Socket leak #434

Comments

jyounggo commented Oct 12, 2022 • edited by consideRatio Loading

Bug description

Your personal set up

welcome bot commented Oct 12, 2022

manics commented Oct 12, 2022

yuvipanda commented Oct 31, 2022

yuvipanda commented Oct 31, 2022

This comment was marked as resolved.

minrk commented Feb 6, 2023

consideRatio commented Nov 27, 2023

consideRatio commented Feb 2, 2024

consideRatio commented Feb 2, 2024

consideRatio commented Feb 2, 2024

consideRatio commented Feb 2, 2024

consideRatio commented Feb 2, 2024

consideRatio commented Feb 2, 2024 • edited Loading

consideRatio commented Feb 2, 2024

manics commented Feb 3, 2024

shaneknapp commented Feb 5, 2024 • edited Loading

minrk commented Feb 5, 2024

shaneknapp commented Feb 7, 2024

shaneknapp commented Feb 7, 2024 • edited Loading

shaneknapp commented Feb 7, 2024

shaneknapp commented Feb 7, 2024

consideRatio commented Feb 7, 2024

shaneknapp commented Feb 7, 2024 • edited Loading

shaneknapp commented Feb 7, 2024 • edited Loading

shaneknapp commented Feb 7, 2024 • edited Loading

shaneknapp commented Feb 8, 2024 • edited Loading

shaneknapp commented Feb 8, 2024 • edited Loading

shaneknapp commented Feb 8, 2024

yuvipanda commented Feb 8, 2024 • edited Loading

shaneknapp commented Feb 8, 2024

consideRatio commented Feb 9, 2024

shaneknapp commented Feb 9, 2024 via email

shaneknapp commented Feb 13, 2024

shaneknapp commented Feb 21, 2024

minrk commented Feb 22, 2024

shaneknapp commented Feb 22, 2024 • edited Loading

shaneknapp commented Feb 29, 2024

shaneknapp commented Mar 7, 2024 • edited Loading

shaneknapp commented May 22, 2024

minrk commented May 23, 2024

manics commented Jun 7, 2024 • edited Loading

manics commented Aug 13, 2024

shaneknapp commented Sep 13, 2024

jyounggo commented Oct 12, 2022 •

edited by consideRatio

Loading

consideRatio commented Feb 2, 2024 •

edited

Loading

shaneknapp commented Feb 5, 2024 •

edited

Loading

shaneknapp commented Feb 7, 2024 •

edited

Loading

shaneknapp commented Feb 7, 2024 •

edited

Loading

shaneknapp commented Feb 7, 2024 •

edited

Loading

shaneknapp commented Feb 7, 2024 •

edited

Loading

shaneknapp commented Feb 8, 2024 •

edited

Loading

shaneknapp commented Feb 8, 2024 •

edited

Loading

yuvipanda commented Feb 8, 2024 •

edited

Loading

shaneknapp commented Feb 22, 2024 •

edited

Loading

shaneknapp commented Mar 7, 2024 •

edited

Loading

manics commented Jun 7, 2024 •

edited

Loading