Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on upgrade from 8.20.0 -> 8.25.0 via Span.finish #3977

Open
nebsta opened this issue May 14, 2024 · 35 comments
Open

Crash on upgrade from 8.20.0 -> 8.25.0 via Span.finish #3977

nebsta opened this issue May 14, 2024 · 35 comments

Comments

@nebsta
Copy link

nebsta commented May 14, 2024

Platform

iOS

Environment

Develop

Installed

Swift Package Manager

Version

8.25.0

Did it work on previous versions?

8.20.0

Steps to Reproduce

  1. Install a version of the app that is using version 8.20.0 of the Sentry iOS SDK
  2. Update to a version of the app that is using version 8.25.0 of the Sentry iOS SDK
  3. Crash occurs on startup

Expected Result

We expect the crash not to occur

Actual Result

330344319-4c996226-fbde-4d92-b44f-eaeee4a9839f

Screenshot 2024-05-14 at 10 52 38

Screenshot 2024-05-14 at 10 52 48

Few other notes:

  • The span being finished is one that captures the boot time for all our internal systems at startup
  • After the crash, if you restart the app on the 8.25.0 version there won't be any crash after that

Are you willing to submit a PR?

No response

@philipphofmann
Copy link
Member

Hey @nebsta,

I tried to reproduce the issue, but couldn't. I see that you have your own wrapper around SenrySpan. Can you share more details on how to initialize your instance of SentrySpan so we can reproduce the issue? A small sample project that reproduces the issue would even be the best. Furthermore, it would be great if you could share the crash report with us.

It seems like the culprit is #3892, which moved capturing the transaction to a background thread.

@nebsta
Copy link
Author

nebsta commented May 28, 2024

Hey @nebsta,

I tried to reproduce the issue, but couldn't. I see that you have your own wrapper around SenrySpan. Can you share more details on how to initialize your instance of SentrySpan so we can reproduce the issue? A small sample project that reproduces the issue would even be the best. Furthermore, it would be great if you could share the crash report with us.

It seems like the culprit is #3892, which moved capturing the transaction to a background thread.

@philipphofmann sorry for the late reply. Didn't get a notification about you responding. Here is a snippet of what the wrapper looks like. We mainly have this so that we can swap out reporting libraries in the future if we want to. As you can see it's pretty simple, nothing too fancy.

// Copyright © 2023 Trustpilot. All rights reserved.

import Foundation
import Sentry

class SentrySpan: SentrySpanProtocol {
  private let sentrySpan: Span
  
  init(span: Span) {
    self.sentrySpan = span
  }
  
  func startChild(operation: String) -> SentrySpanProtocol {
    let sentrySpan = self.sentrySpan.startChild(operation: operation)
    return SentrySpan(span: sentrySpan)
  }
  
  func finish(status: TrustpilotSpanStatus) {
    sentrySpan.finish(status: status.toSentrySpanStatus())
  }
  
  func finish() {
    sentrySpan.finish()
  }
}

extension TrustpilotSpanStatus {
  func toSentrySpanStatus() -> SentrySpanStatus {
    switch self {
    case .internalError: return .internalError
    }
  }
}

// Copyright © 2023 Trustpilot. All rights reserved.

import Foundation

protocol SentrySpanProtocol {
  func startChild(operation: String) -> SentrySpanProtocol
  func finish(status: TrustpilotSpanStatus)
  func finish()
}

enum TrustpilotSpanStatus {
  case internalError
}

I'll see if I can get a full crash report from the sentry dashboard and post it here soon.

@philipphofmann
Copy link
Member

Thanks for the update, @nebsta. The full crash report would be helpful. I think I see a couple of occurrences of this in our internal SDK crash detection.

@nebsta
Copy link
Author

nebsta commented May 29, 2024

@philipphofmann classically I can't seem to replicate the issue anymore, so I'm unable to get the crash log for you. But by the sounds of it, it seems like it's a known issue if you have seen occurrences of it.

@philipphofmann
Copy link
Member

Yes, I think we should have enough information to fix it. Thanks for the update.

@nebsta
Copy link
Author

nebsta commented Jun 12, 2024

@philipphofmann any update on a fix?

@philipphofmann
Copy link
Member

@nebsta, I'm sorry no, not yet. We didn't get to this yet.

@nebsta
Copy link
Author

nebsta commented Jul 2, 2024

@philipphofmann are there any updates on this issue? We are currently blocked with upgrading to the latest sentry, as if we do, it will cause most (if not all) of our users to experience a crash on start up.

@krystofwoldrich
Copy link
Member

Hi @nebsta,
still no updates. But we will keep this issue up to date.

@P-Foley-Trustpilot
Copy link

Hey @krystofwoldrich @philipphofmann - Just to update on this, this issue is currently blocking us from upgrading to Xcode 16 as 8.20.0 won't compile with Xcode 16 and upgrading to 8.36.0 (the latest version) still causes this crash.
Were you able to reproduce this issue?

@brustolin
Copy link
Contributor

Hello @P-Foley-Trustpilot can you share the error Xcode 16 gives with version 8.20.0?

@P-Foley-Trustpilot
Copy link

Hi @brustolin, i've attached the errors list and where they occurImage
Image
Image

@brustolin
Copy link
Contributor

Oh yes. We have fixes for those missing references.
And Im sorry, but we dont have any feedback on the Span issue, we couldn't reproduce it.

@philipphofmann
Copy link
Member

@nebsta, thanks for the update. Can you share a stacktrace of a crash on a simulator?

@nebsta
Copy link
Author

nebsta commented Sep 20, 2024

@nebsta, thanks for the update. Can you share a stacktrace of a crash on a simulator?

I can't seem to export the detailed stack trace. The best I can do is the screenshot that I originally attached to the ticket

@nebsta
Copy link
Author

nebsta commented Sep 25, 2024

@philipphofmann I had a thought the other day. Are the span operations thread safe? Because we're currently calling span.finish within an actor, so it's being executed in the threadpool and not on the main actor. Could that potentially be why we're seeing this problem?

@brustolin
Copy link
Contributor

Hello @nebsta, span operations are thread safe, finishing it outside the main actor should not cause a problem.

Do you have Sentry profiling enabled? If so, can you test without it?

@nebsta
Copy link
Author

nebsta commented Sep 27, 2024

Hello @nebsta, span operations are thread safe, finishing it outside the main actor should not cause a problem.

Do you have Sentry profiling enabled? If so, can you test without it?

Yeah we do, I can turn it off and see if that resolves the issue

@nebsta
Copy link
Author

nebsta commented Sep 27, 2024

@brustolin I tried setting the profiling sample rate to 0, and it doesn't seem to be happening anymore, though I also noticed before changing the same rate, that the exception was happening on my device via XCode even on a fresh install. Which is different to how we originally reported it, where it would only happen when we updated from an older version of sentry.

@nebsta
Copy link
Author

nebsta commented Sep 27, 2024

@brustolin @philipphofmann I did notice one thing that may be related to this issue. Would like to get your thoughts on it.

Basically we use the manually profiling transactions/spans for timing how long our bootstrapping process takes in the app. Our Bootstrapping system is broken up into steps, with each step being a single span within the overall bootstrap transaction.

I just noticed that there is actually a bug whereby if one of the bootstrapping items fails, we actually never finish the running span, or the running transaction. This would also makes sense, as I have also seen our profiling sometimes showing ridiculously large bootstrapping times, which this is probably the reason for.

The question is that if we don't finish a span/transaction, could there be issue here if we try to finish the same span/transaction when a previous one wasn't finished correctly?

@philipphofmann
Copy link
Member

The question is that if we don't finish a span/transaction, could there be issue here if we try to finish the same span/transaction when a previous one wasn't finished correctly?

Yes, that could cause problems, but he SDK shouldn't crash. If you can confirm that this fixes the problem, it's easier for us to investigate the root cause.

@brustolin
Copy link
Contributor

The question is that if we don't finish a span/transaction, could there be issue here if we try to finish the same span/transaction when a previous one wasn't finished correctly?

Finishing or refinishing a span should not matter; we have safeguards for these scenarios.

If you could provide us with a reproduction of your flow and how you use transactions/spans, it would help us a lot in identifying whether we have a bug in the said safeguards.

@nebsta
Copy link
Author

nebsta commented Sep 27, 2024

@brustolin I've outlined an explanation of how we're using them below, including some code examples as well. Hopefully that gives you a better picture.

The main class where we use the spans and where this exception is happening is called the Bootstrapper. This is an actor, which goes through a list of BootstrapItem instances (which are each themselves actor instances as well), and executes some setup work that we need to perform on the startup of our application.

You can see a full gist of what the Boostrapper looks like here, and what one of the BootstrapItem implementations look like here

If you look within the Bootstrapper.loadAll method, you'll see where we are creating/finishing all the spans for the bootstrap process. We start 1 span for the entire bootstrap process, and then spin off child spans for each item so that we can get a better breakdown of where our bottlenecks are.

    let transaction = dependencies.sentry.startTransaction(name: "Bootstrap", operation: "Start") // HERE
    for item in items {
      let itemName = String(describing: type(of: item))
      TLog.bootstrap.info("Loading item: \(itemName)")
      
      dependencies.sentry.addBreadcrumb(BootstrapItemBreadcrumb.started(name: itemName))
      
      var span: SentrySpanProtocol?
      do {
        span = transaction?.startChild(operation: itemName) // HERE
        try await item.load()
        try Task.checkCancellation()
        
        dependencies.sentry.addBreadcrumb(BootstrapItemBreadcrumb.finished(name: itemName))
        span?.finish()
      } catch {
        TLog.bootstrap.error("Error encountered on item: \(itemName), error: \(error)")
        span?.finish(status: .internalError) // HERE
        transaction?.finish() // HERE
        dependencies.sentry.addBreadcrumb(BootstrapItemBreadcrumb.failed(name: itemName))
        
        /// Some bootstrap items may not need to raise errors up to Sentry, so we log them here instead.
        guard item.shouldRaiseFailureToSentry else {
          TLog.bootstrap.error("Error encountered on item: \(itemName), error: \(error)")
          return
        }
        throw BootstrapError.itemFailed(itemName: String(itemName), error: error)
      }
    }
    transaction?.finish() // HERE

As I mentioned, there is a bug in this implementation, as if there is an exception in any of the bootstrap items, then we don't properly finish the span for the entire bootstrap process, which is something we're going to fix on our end. This is why I was wondering if perhaps this could be the reason for our crash, as I mentioned that we do sometimes see some instances in the 99th percentile that are reported to be going for as long as a couple of days, so I wonder if there is an issue in the mechanism for ensure that spans are always cleaned up.

Image

Hopefully that's enough to give you a better picture.

@nebsta
Copy link
Author

nebsta commented Sep 27, 2024

I should also mention that the spans we're using the Bootstrapper are actually our own span class (which you can see here, which is just a light wrapper around the Sentry Span class. We do this so that our reporting manager is SDK agnostic, and so we can more easily write unit testing.

@philipphofmann
Copy link
Member

I looked at the diff in SentryTracer from 8.20.0 to 8.25.0 and couldn't find anything suspicious that could cause this crash.

@nebsta, can you try to enable the Sentry Cocoa SDK debug logs with options.debug = true and share them with us here please? That could help us identify the problem.

Also, can you try to update to the latest version 8.36.0, and check if the problem persists? I know you currently can't update to a higher version than 8.25.0, but if you could check that in debug and it works, we know the problem was fixed.

There is a small chance that #4380 fixes the problem, but I doubt it because canceling the deadline timer for the manual transaction should be a NoOp as the deadline timer is nil.

@nebsta
Copy link
Author

nebsta commented Oct 1, 2024

I looked at the diff in SentryTracer from 8.20.0 to 8.25.0 and couldn't find anything suspicious that could cause this crash.

@nebsta, can you try to enable the Sentry Cocoa SDK debug logs with options.debug = true and share them with us here please? That could help us identify the problem.

Also, can you try to update to the latest version 8.36.0, and check if the problem persists? I know you currently can't update to a higher version than 8.25.0, but if you could check that in debug and it works, we know the problem was fixed.

There is a small chance that #4380 fixes the problem, but I doubt it because canceling the deadline timer for the manual transaction should be a NoOp as the deadline timer is nil.

@philipphofmann We have been attempting to upgrade to 8.36.0 instead since it came out, but we're still getting the same issue. I'll try enabling the logs and see how that goes

@nebsta
Copy link
Author

nebsta commented Oct 1, 2024

@philipphofmann I've attached the logs that appear when enabling debug leading up to the crash. One thing I also noticed actually looking at the code, which you'll also see in the logs, is that we actually start sentry on boot, then we stop it after the bootstrapper and then call start agains straight after. I can't remember exactly why we do this, but I wonder if this could also be causing issues.

sentrylogs.log

@philipphofmann
Copy link
Member

The last log message is

[Sentry] [debug] [SentrySDK:224] SDK main thread init started...

The logs don't show more log messages for installing all the integrations. Do you start your transaction leading to a crash directly after initializing the SDK a second time? I tried to reproduce your problem by starting the SDK, stopping it, starting it again and then directly capturing a transaction, but it works flawlessly.

is that we actually start sentry on boot, then we stop it after the bootstrapper and then call start agains straight after. I can't remember exactly why we do this, but I wonder if this could also be causing issues.

Can you maybe try not stopping and starting the SDK and check if the crash still occurs? If it disappears, it would make it easier to identify the problem.

@nebsta
Copy link
Author

nebsta commented Oct 2, 2024

@philipphofmann

Do you start your transaction leading to a crash directly after initialising the SDK a second time?

The finish on the transaction that triggers the crash is before we stop and start the system.

Can you maybe try not stopping and starting the SDK and check if the crash still occurs? If it disappears, it would make it easier to identify the problem.

I'll give this a go today and get back to you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Status: Backlog
Development

No branches or pull requests

6 participants