Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on com.apollographql.websocket on _inputStreamCallbackFunc #3390

Open
aaronbarsky opened this issue Jun 10, 2024 · 16 comments
Open

Crash on com.apollographql.websocket on _inputStreamCallbackFunc #3390

aaronbarsky opened this issue Jun 10, 2024 · 16 comments
Labels

Comments

@aaronbarsky
Copy link

Summary

Crashlytics is reporting a very rare but consistent crash on the com.apollograph.websocket queue.

Crashed: com.apollographql.websocket
EXC_BAD_ACCESS KERN_INVALID_ADDRESS 0x0000000f9d3993a0
0 libobjc.A.dylib objc_retain + 16
1 libobjc.A.dylib objc_retain_x0 + 16
2 CoreFoundation _inputStreamCallbackFunc + 60
3 CoreFoundation _signalEventSync + 216
4 CoreFoundation ___signalEventQueue_block_invoke + 28
5 libdispatch.dylib _dispatch_call_block_and_release + 32

The app is a mix of 60% foreground and 40% background.

Version

1.9.2

Steps to reproduce the behavior

Unfortunately I only have crash logs

Logs

No response

Anything else?

My hunch is that it's the FoundationStream teardown sequence:

 func cleanup() {
    if let stream = inputStream {
      stream.delegate = nil
      CFReadStreamSetDispatchQueue(stream, nil)
      stream.close()
    }

stream.delegate is an unowned(unsafe) var.

There are several codepaths where cleanup is not called on com.apollographql.websocket. When this happens, delegate can be set to nil as the CFReadStream is attempting to invoke a callback on the deallocated delegate.

@aaronbarsky aaronbarsky added bug Generally incorrect behavior needs investigation labels Jun 10, 2024
@calvincestari
Copy link
Member

Hi @aaronbarsky - you mentioned this being consistent; do you know the conditions or sequence of events that leads to the crash. Being able to have a reproduction case would be immensely helpful to resolving this.

@aaronbarsky
Copy link
Author

By consistent, I mean that the call stack is exactly the same in each case. It happens about 1/500 sessions.

The challenge is that the start of the crash is

CoreFoundation _inputStreamCallbackFunc

which makes it quite tough to write a unit test or minimally reproducible example.

@AnthonyMDev
Copy link
Contributor

We're going to really struggle to create or verify a fix for this since, as you said, its difficult to reproduce it.

There are several codepaths where cleanup is not called on com.apollographql.websocket. When this happens, delegate can be set to nil as the CFReadStream is attempting to invoke a callback on the deallocated delegate.

What codepaths are you referring to here? If there are some obvious areas where we are creating unsafe memory access, I'd suggest trying to fork the library, fix the cleanup() methods calls and ship a new version. If that prevents the crashes, we'd be happy to take a PR with the fixes you've made?

@aaronbarsky
Copy link
Author

I worked around the problem with an explicit call to closeConnection.

Previously I had

func makeWebSocketClient(connectionHeaders:JSONEncodableDictionary) -> ApolloClientProtocol {
   let configuration = WebSocketTransport.Configuration(
            reconnect: false,
            connectingPayload: connectionHeaders)

    let webSocketClient = WebSocket(
            url: ...,
            protocol: .graphql_transport_ws)

    let wsTransport = WebSocketTransport(
            websocket: webSocketClient,
            config: configuration)

    return ApolloClient(networkTransport: wsTransport, store: ApolloStore(cache: InMemoryNormalizedCache()))
}
    // use client
    apolloClient = nil

Now I keep an explicit reference to the networkTransport so I can close it before deallocating the client

   let configuration = ...
   let webSocketClient = ...
   self.wsTransport = ...
   return  apolloClient = ...
   // use client
   self.wsTransport.closeConnection()
   self.wsTransport = nil
   self.apolloClient = nil

(Note that networkTransport is internal to ApolloClient so I need to keep the extra reference for access to the closeConnection function)

Fixing it in the library would require a lot of changes, that could be complex.

The problem is that FoundationStream marks the delgate as @unowned(unsafe) instead of @weak. This means that when the delegate is deallocated, rather than skipping the callback, it crashes. To avoid this, we need to make sure that the delegate is only released after close has been called on the callback queue.

Current situation

   FoundationStreamQueue         DelegateCallbackQueue          ClientQueue
                                                                apolloClient = nil
    receiveData
    scheduleCallbackOnDelegate                                  deinit
                                                                cleanupStream
                                                                cleanup
                                                                delegate deallocated
                                 delegate.stream(handle:)
                                 Crash()

My theoretical fix would look something like

   FoundationStreamQueue         DelegateCallbackQueue          ClientQueue
                                                                apolloClient = nil
    receiveData
    scheduleCallbackOnDelegate                                  deinit
                                                                cleanupStream
                                                                cleanup
                                                                capture strong reference to delegate in work block
                                                                schedule block on StreamCallbackQueue
                                 delegate.stream(handle:)
                                 complete work block releasing last reference
                                 delegate deallocated

The tricky bit is that you can't make strong references to self in the deinit. So rather than the delegate being an instance of Apollo.FoundationStream, it would have to be a separate object with a weak reference back to FoundationStream.

I'm too busy to make an attempt on this now. Perhaps if this code doesn't change much in Apollo 2.0, I can give it another look in the future.

@calvincestari
Copy link
Member

calvincestari commented Sep 11, 2024

Thanks for the great debugging @aaronbarsky.

I'm too busy to make an attempt on this now. Perhaps if this code doesn't change much in Apollo 2.0, I can give it another look in the future.

The current StarScream-based websocket implementation will be removed from Apollo iOS; it's old, unsupported and hindering progress. Ideally we build a default websocket implementation using iOS native websocket APIs but allow users to swap in whatever websocket library they choose to use. When that happens is still undecided though. We're trying to keep the scope of 2.0 breaking changes as small as possible and it really depends how much work is needed here to achieve that.

@pixelmatrix
Copy link

I'm seeing a handful of crashes like this as well in my app

@pixelmatrix
Copy link

pixelmatrix commented Sep 12, 2024

@aaronbarsky Isn't setting the delegate to nil supposed to set the delegate back to the stream instance itself? I'm not seeing this happening in the debugger, but that's what the documentation says:

By a default, a stream object must be its own delegate; so a delegate message with an argument of nil should restore this delegate. Don’t retain the delegate to prevent retain cycles.

If that's the case, could the crash be that inputStream or outputStream are getting set to nil, thus the delegate is nil when the system attempts to call it?


On my end, the crashes seem to be happening because I call pauseWebSocketConnection when the app goes into the background, which eventually triggers cleanup, but leaves the Apollo.FoundationStream stream instance in place. The subscriptions can be pretty active, so I'm thinking the crash is happening when a pending message is queued up while we pause the connection.

@arnauddorgans
Copy link
Contributor

Any update here? I see a lot of crashes with 100% iOS 18 on com.apollographql.websocket

@calvincestari
Copy link
Member

@arnauddorgans - if you're able to provide a reproduction sample project or more debugging on your end may help in figuring out the root cause. I think this issue is becoming a catch-all for anything websocket related and they may not all be the same thing.

@arnauddorgans
Copy link
Contributor

I am not able to reproduce it, but it is our top crash

Screenshot 2024-10-08 at 11 02 50
          Crashed: com.apollographql.websocket
0  CoreFoundation                 0xfe54 <redacted> + 76
1  CoreFoundation                 0xcd3c CFRelease + 60
2  CoreFoundation                 0xc05e0 <redacted> + 252
3  CoreFoundation                 0x112050 <redacted> + 28
4  libdispatch.dylib              0x2370 _dispatch_call_block_and_release + 32
5  libdispatch.dylib              0x40d0 _dispatch_client_callout + 20
6  libdispatch.dylib              0xb6d8 _dispatch_lane_serial_drain + 744
7  libdispatch.dylib              0xc1e0 _dispatch_lane_invoke + 380
8  libdispatch.dylib              0x17258 _dispatch_root_queue_drain_deferred_wlh + 288
9  libdispatch.dylib              0x16aa4 _dispatch_workloop_worker_thread + 540
10 libsystem_pthread.dylib        0x4c7c _pthread_wqthread + 288
11 libsystem_pthread.dylib        0x1488 start_wqthread + 8

@aaronbarsky
Copy link
Author

I finally caught this crash in action. Not sure if this is helpful or not:
Screenshot 2024-10-30 at 2 10 05 PM

@arnauddorgans
Copy link
Contributor

Same here
Screenshot 2024-11-07 at 23 52 42

@vladdorfman
Copy link

Is it possible that next minor version will contains the fix for this? It seems reproducible.

@calvincestari
Copy link
Member

Is it possible that next minor version will contains the fix for this? It seems reproducible.

@vladdorfman, all the recent instances of crash captures seem to be adhoc. I don't think it's reliably reproducible yet but if you're able to write a test that can consistently crash please do.

@AnthonyMDev
Copy link
Contributor

Yeah, we will need a reproduction case we can use to debug and verify a fix. Even if it only crashes one out of ten times. But a screenshot of the stack trace in your existing project is not enough for us to track this down.

If you are able to create a unit test or provide an example project with the crash occurring, we will prioritize a fix.

@calvincestari
Copy link
Member

I spent some time on this issue yesterday and again today trying to craft a test that fails. I haven't managed to replicate a single failure in 10000 repetitions of various tests; closing the websocket, leaving it dangling, setting the transport to nil, cancelling the subscription, etc.

The stack trace screenshots unfortunately don't tell us much other than it's happening in cleanup, and I don't think we can confidently make any changes to the cleanup code without knowing exactly how things are failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants