Memory leak in the fetch instrumentation when used against an infinite fetch request resulting in memory leak #4888

shuhaowu · 2024-07-30T19:53:10Z

What happened?

Steps to Reproduce

Setup auto instrumentation in an application.
Open an infinite fetch request to a server to stream a large amount of data. Read the data via response.getReader().read() and discard the data (i.e. do not store the data in JS memory).
Eventually Chrome will crash as it uses ~10GB of memory holding the response in memory due to the patching done by OpenTelemetry (see additional details below for analysis).

Expected Result

No memory leak occurs.

Actual Result

Memory leaks, browser/OS eventually kills the tab due to memory exhaustion.

Additional Details

Looking at the implementation of patchConstructor, we see these lines where the response is cloned into 2 additional variables:

opentelemetry-js/experimental/packages/opentelemetry-instrumentation-fetch/src/fetch.ts

Lines 351 to 352 in 2e42181

    
           const resClone = response.clone(); 
        
           const resClone4Hook = response.clone();

The body of resClone is read. The body data is not used by the code, but this allows the data to be freed from memory:

opentelemetry-js/experimental/packages/opentelemetry-instrumentation-fetch/src/fetch.ts

Lines 357 to 364 in 2e42181

    
           reader.read().then( 
        
             ({ done }) => { 
        
               if (done) { 
        
                 endSpanOnSuccess(span, resClone4Hook); 
        
               } else { 
        
                 read(); 
        
               } 
        
             },

However, the body of resClone4Hook is never read from. This causes the browser to keep the response data in memory despite it being consumed by both the original stream by the user and the resClone stream. The resClone4Hook is only used to be passed to endSpanOnSuccess:

opentelemetry-js/experimental/packages/opentelemetry-instrumentation-fetch/src/fetch.ts

Line 360 in 2e42181

endSpanOnSuccess(span, resClone4Hook);

The code within opentelemetry doesn't seem to use response.body in any way. It does seem like it passes the response to potentially user-defined functions.

In any case enabling autoinstrumentation causes memory-leak induced browser tab crashes when used with infinite fetch requests. This is very difficult to debug as the memory leak is not even in the JS heap (since resClone4Hook's body never got read to JS, the memory used doesn't show up in JS heap dumps), and instead is happening inside Chrome's private memory. The regular tab JS heap OOM killer doesn't even work with it. Something else in Chrome kills the tab after ~13GB of RAM usage.

OpenTelemetry Setup Code

No response

package.json

No response

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

shuhaowu · 2024-07-31T20:11:59Z

One possible suggestion is the close the resClone4Hook's body. That probably will resolve this leak.

chancancode · 2024-12-11T09:53:54Z

I looked into this a bit and made a reproduction for this: https://github.com/tildeio/otel-js-demos/tree/4888-fetch-memory-leak

@shuhaowu did a good job identifying the issue, and once you see the pattern, it's pretty obvious why this would happen. IMO it would be more clear to distill the problematic pattern down in a standalone demo without involving the instrumentation code, so that's what I did.

I did some digging and this was introduced in #2497. IMO, the feature was poorly motivated and is fundamentally incompatible with infinite/long streaming responses.

The context here is that applyCustomAttributesOnSpan is a hook that gets called only after the response is fully streamed. In #2497, the feature request was to preserve the ability to read/introspect the response body at that point, so in order to do that, the solution was to eagerly clone the response, hold on to it, and then pass that cloned response object to the hook when the response has finished streaming. This fundamentally forces the browser to buffer and hold on to the entire response body, and in the case of infinite/long-running streams, that is a fatal strategy.

Now, there is perhaps an argument here that infinite/long-running streams are also fundamentally incompatible with instrumentation anyway, as they hold up the span/trace which can cause other problems. However, because this is done automatically for all fetch requests, it can be hard to notice/trace down as @shuhaowu pointed out; in addition, based on the way the code is written today, the ignore* options doesn't really disable most of the instrumentation code, so it doesn't seem like it would have prevented this issue.

This current design isn't only problematic with infinite/long-running streams. For example, I imagine [citation needed] if the body is consumed via await response.json();, the browser could probably use a streaming JSON parser to parse the body into JS objects as it arrive, without buffering the raw response bytes. The current code would force the browser to always buffer and hold onto the response.

Overall, I tend to think that #2497 was a mistake. "response body can only be read once" is a fundamental design choice of the web platform and should be expected/respected. Eagerly cloning and holding on to the response object is essentially intentionally defeating the optimizations afforded by this design choice and there is a pretty high bar to clear there, and the original feature request just didn't provide a good reason for it. IMO, it probably should just be reverted. ~~Alternatively, with the upcoming v2 timeline, breaking this in v2 may be safer, but at the expense of keeping around the unfixable memory issue in the v1 branch.~~ (nvm, this is an experimental package)

A possible compromise is to add an extra hook that gets called with the original response object before the fetch promise resolves. It feels too easy to misuse however – if whoever uses the hook forgets to clone and consumed the response body, the caller of fetch will get an unusable response object which seems whacky. Since the original feature request didn't elaborate on the actual use case, it's hard to tell what problems it was trying to solve and what solutions would work best.

One possible suggestion is the close the resClone4Hook's body. That probably will resolve this leak.

I added an option to test this in my reproduction/demo app. As far as a I can tell, in Chrome at least, it doesn't do anything. But in any case, I think this is moot – the only reason that extra clone exists was to enable the hook to read the response body, and this suggestion would break that "feature". If we are walking back that "feature" anyway, we could just pass the original response object through to the hook at that point.

…CustomAttributes` hook Previously, the fetch instrumentation code unconditionally clones every `fetch()` response in order to preserve the ability for the `applyCustomAttributes` hook to consume the response body. This is fundamentally unsound, as it forces the browser to buffer and retain the response body until it is fully received, which crates unnecessary memory pressure on long-running response streams. Fixes open-telemetry#4888

…CustomAttributes` hook Previously, the fetch instrumentation code unconditionally clones every `fetch()` response in order to preserve the ability for the `applyCustomAttributes` hook to consume the response body. This is fundamentally unsound, as it forces the browser to buffer and retain the response body until it is fully received and read, which crates unnecessary memory pressure on large or long-running response streams. In extreme cases, this is effectively a memory leak and can cause the browser tab to crash. Fixes open-telemetry#4888

shuhaowu added bug Something isn't working triage labels Jul 30, 2024

dyladan added priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies, etc needs:reproducer This bug/feature is in need of a minimal reproducer and removed triage labels Jul 31, 2024

dyladan self-assigned this Jul 31, 2024

pichlermarc added the pkg:instrumentation-fetch label Jul 31, 2024

chancancode mentioned this issue Dec 14, 2024

[instrumentation-fetch] refactor fetch() tests for clarity, type safety and realism #5268

Merged

10 tasks

chancancode linked a pull request Dec 18, 2024 that will close this issue

feat(instrumentation-fetch)! Passthrough original request to applyCustomAttributes #5281

Open

8 tasks

chancancode mentioned this issue Dec 21, 2024

[instrumentation-fetch] Support adding attributes to http spans based on the response body #5293

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in the fetch instrumentation when used against an infinite fetch request resulting in memory leak #4888

Memory leak in the fetch instrumentation when used against an infinite fetch request resulting in memory leak #4888

shuhaowu commented Jul 30, 2024 •

edited

Loading

shuhaowu commented Jul 31, 2024

chancancode commented Dec 11, 2024 •

edited

Loading

Memory leak in the fetch instrumentation when used against an infinite fetch request resulting in memory leak #4888

Memory leak in the fetch instrumentation when used against an infinite fetch request resulting in memory leak #4888

Comments

shuhaowu commented Jul 30, 2024 • edited Loading

What happened?

Steps to Reproduce

Expected Result

Actual Result

Additional Details

OpenTelemetry Setup Code

package.json

Relevant log output

shuhaowu commented Jul 31, 2024

chancancode commented Dec 11, 2024 • edited Loading

shuhaowu commented Jul 30, 2024 •

edited

Loading

chancancode commented Dec 11, 2024 •

edited

Loading