Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in the fetch instrumentation when used against an infinite fetch request resulting in memory leak #4888

Open
shuhaowu opened this issue Jul 30, 2024 · 2 comments · May be fixed by #5281
Assignees
Labels
bug Something isn't working needs:reproducer This bug/feature is in need of a minimal reproducer pkg:instrumentation-fetch priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies, etc

Comments

@shuhaowu
Copy link

shuhaowu commented Jul 30, 2024

What happened?

Steps to Reproduce

  1. Setup auto instrumentation in an application.
  2. Open an infinite fetch request to a server to stream a large amount of data. Read the data via response.getReader().read() and discard the data (i.e. do not store the data in JS memory).
  3. Eventually Chrome will crash as it uses ~10GB of memory holding the response in memory due to the patching done by OpenTelemetry (see additional details below for analysis).

Expected Result

No memory leak occurs.

Actual Result

Memory leaks, browser/OS eventually kills the tab due to memory exhaustion.

Additional Details

Looking at the implementation of patchConstructor, we see these lines where the response is cloned into 2 additional variables:

const resClone = response.clone();
const resClone4Hook = response.clone();

The body of resClone is read. The body data is not used by the code, but this allows the data to be freed from memory:

reader.read().then(
({ done }) => {
if (done) {
endSpanOnSuccess(span, resClone4Hook);
} else {
read();
}
},

However, the body of resClone4Hook is never read from. This causes the browser to keep the response data in memory despite it being consumed by both the original stream by the user and the resClone stream. The resClone4Hook is only used to be passed to endSpanOnSuccess:

The code within opentelemetry doesn't seem to use response.body in any way. It does seem like it passes the response to potentially user-defined functions.

In any case enabling autoinstrumentation causes memory-leak induced browser tab crashes when used with infinite fetch requests. This is very difficult to debug as the memory leak is not even in the JS heap (since resClone4Hook's body never got read to JS, the memory used doesn't show up in JS heap dumps), and instead is happening inside Chrome's private memory. The regular tab JS heap OOM killer doesn't even work with it. Something else in Chrome kills the tab after ~13GB of RAM usage.

OpenTelemetry Setup Code

No response

package.json

No response

Relevant log output

No response

@shuhaowu shuhaowu added bug Something isn't working triage labels Jul 30, 2024
@dyladan dyladan added priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies, etc needs:reproducer This bug/feature is in need of a minimal reproducer and removed triage labels Jul 31, 2024
@dyladan dyladan self-assigned this Jul 31, 2024
@shuhaowu
Copy link
Author

One possible suggestion is the close the resClone4Hook's body. That probably will resolve this leak.

@chancancode
Copy link
Contributor

chancancode commented Dec 11, 2024

I looked into this a bit and made a reproduction for this: https://github.com/tildeio/otel-js-demos/tree/4888-fetch-memory-leak

@shuhaowu did a good job identifying the issue, and once you see the pattern, it's pretty obvious why this would happen. IMO it would be more clear to distill the problematic pattern down in a standalone demo without involving the instrumentation code, so that's what I did.

I did some digging and this was introduced in #2497. IMO, the feature was poorly motivated and is fundamentally incompatible with infinite/long streaming responses.

The context here is that applyCustomAttributesOnSpan is a hook that gets called only after the response is fully streamed. In #2497, the feature request was to preserve the ability to read/introspect the response body at that point, so in order to do that, the solution was to eagerly clone the response, hold on to it, and then pass that cloned response object to the hook when the response has finished streaming. This fundamentally forces the browser to buffer and hold on to the entire response body, and in the case of infinite/long-running streams, that is a fatal strategy.

Now, there is perhaps an argument here that infinite/long-running streams are also fundamentally incompatible with instrumentation anyway, as they hold up the span/trace which can cause other problems. However, because this is done automatically for all fetch requests, it can be hard to notice/trace down as @shuhaowu pointed out; in addition, based on the way the code is written today, the ignore* options doesn't really disable most of the instrumentation code, so it doesn't seem like it would have prevented this issue.

This current design isn't only problematic with infinite/long-running streams. For example, I imagine [citation needed] if the body is consumed via await response.json();, the browser could probably use a streaming JSON parser to parse the body into JS objects as it arrive, without buffering the raw response bytes. The current code would force the browser to always buffer and hold onto the response.

Overall, I tend to think that #2497 was a mistake. "response body can only be read once" is a fundamental design choice of the web platform and should be expected/respected. Eagerly cloning and holding on to the response object is essentially intentionally defeating the optimizations afforded by this design choice and there is a pretty high bar to clear there, and the original feature request just didn't provide a good reason for it. IMO, it probably should just be reverted. Alternatively, with the upcoming v2 timeline, breaking this in v2 may be safer, but at the expense of keeping around the unfixable memory issue in the v1 branch. (nvm, this is an experimental package)

A possible compromise is to add an extra hook that gets called with the original response object before the fetch promise resolves. It feels too easy to misuse however – if whoever uses the hook forgets to clone and consumed the response body, the caller of fetch will get an unusable response object which seems whacky. Since the original feature request didn't elaborate on the actual use case, it's hard to tell what problems it was trying to solve and what solutions would work best.

One possible suggestion is the close the resClone4Hook's body. That probably will resolve this leak.

I added an option to test this in my reproduction/demo app. As far as a I can tell, in Chrome at least, it doesn't do anything. But in any case, I think this is moot – the only reason that extra clone exists was to enable the hook to read the response body, and this suggestion would break that "feature". If we are walking back that "feature" anyway, we could just pass the original response object through to the hook at that point.

chancancode added a commit to tildeio/opentelemetry-js that referenced this issue Dec 17, 2024
…CustomAttributes` hook

Previously, the fetch instrumentation code unconditionally clones
every `fetch()` response in order to preserve the ability for the
`applyCustomAttributes` hook to consume the response body. This is
fundamentally unsound, as it forces the browser to buffer and
retain the response body until it is fully received, which crates
unnecessary memory pressure on long-running response streams.

Fixes open-telemetry#4888
chancancode added a commit to tildeio/opentelemetry-js that referenced this issue Dec 17, 2024
…CustomAttributes` hook

Previously, the fetch instrumentation code unconditionally clones
every `fetch()` response in order to preserve the ability for the
`applyCustomAttributes` hook to consume the response body. This is
fundamentally unsound, as it forces the browser to buffer and
retain the response body until it is fully received and read, which
crates unnecessary memory pressure on large or long-running response
streams. In extreme cases, this is effectively a memory leak and can
cause the browser tab to crash.

Fixes open-telemetry#4888
chancancode added a commit to tildeio/opentelemetry-js that referenced this issue Dec 19, 2024
…CustomAttributes` hook

Previously, the fetch instrumentation code unconditionally clones
every `fetch()` response in order to preserve the ability for the
`applyCustomAttributes` hook to consume the response body. This is
fundamentally unsound, as it forces the browser to buffer and
retain the response body until it is fully received and read, which
crates unnecessary memory pressure on large or long-running response
streams. In extreme cases, this is effectively a memory leak and can
cause the browser tab to crash.

Fixes open-telemetry#4888
chancancode added a commit to tildeio/opentelemetry-js that referenced this issue Dec 21, 2024
…CustomAttributes` hook

Previously, the fetch instrumentation code unconditionally clones
every `fetch()` response in order to preserve the ability for the
`applyCustomAttributes` hook to consume the response body. This is
fundamentally unsound, as it forces the browser to buffer and
retain the response body until it is fully received and read, which
crates unnecessary memory pressure on large or long-running response
streams. In extreme cases, this is effectively a memory leak and can
cause the browser tab to crash.

Fixes open-telemetry#4888
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs:reproducer This bug/feature is in need of a minimal reproducer pkg:instrumentation-fetch priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies, etc
Projects
None yet
4 participants