Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document potential crashes when using SDK methods from background threads #116

Closed
RuiAAPeres opened this issue Nov 2, 2024 · 5 comments · Fixed by #128
Closed

Document potential crashes when using SDK methods from background threads #116

RuiAAPeres opened this issue Nov 2, 2024 · 5 comments · Fixed by #128

Comments

@RuiAAPeres
Copy link

RuiAAPeres commented Nov 2, 2024

SDK version: 6.4.2

Context

The SDK enforces main thread usage for initialization (setup and start), but other methods like add(event:) and logging don't have the same enforcement despite being unsafe for background thread usage.

Issue

While the documentation notes that using the SDK from background threads may lead to data inconsistency, testing shows it can actually cause crashes due to concurrent dictionary access:

func test_backgroundThreadUsageCrashes() {
    var sharedSpan = SpanData(
        traceId: .random(),
        spanId: .random(),
        name: "test",
        kind: .internal,
        startTime: Date(),
        attributes: ["emb.type": .string("sys.breadcrumb")],
        endTime: Date()
    )
    
    DispatchQueue.concurrentPerform(iterations: 100) { _ in
        _ = sharedSpan.attributes
        
        var attrs = sharedSpan.attributes
        attrs["test"] = .string("value")
    }
}

This crashes with:

Thread X: EXC_BAD_ACCESS

And a more complex test:

func test_aggressiveConcurrentAccess_shouldCrash() {
    var sharedSpan = SpanData(
        traceId: .random(),
        spanId: .random(),
        name: "test-span",
        kind: .internal,
        startTime: Date(),
        attributes: ["emb.type": .string("sys.breadcrumb")],
        endTime: Date()
    )

    let queues = (0..<5).map { i in
        DispatchQueue(label: "test.queue.\(i)", attributes: .concurrent)
    }
    let group = DispatchGroup()

    for i in 0..<100 {
        group.enter()
        queues[i % queues.count].async {
            _ = sharedSpan.embType
            group.leave()
        }

        group.enter()
        queues[(i + 1) % queues.count].async {
            sharedSpan = sharedSpan.settingAttributes([
                "emb.type": .string("perf.test_\(i)")
            ])
            group.leave()
        }

        group.enter()
        queues[(i + 2) % queues.count].async {
            let breadcrumb = Breadcrumb(
                message: "Message \(i)",
                attributes: [:]
            )
            sharedSpan = sharedSpan.settingAttributes(breadcrumb.attributes)
            group.leave()
        }

        group.enter()
        queues[(i + 3) % queues.count].async {
            let validator = LengthOfNameValidator()
            _ = validator.validate(data: &sharedSpan)
            group.leave()
        }
    }

    group.wait()
}

The above test can crash in multiple places, which furthers illustrates the problem.

Suggestion

Consider either:

  • Adding thread assertions to all public methods (like already done for setup and start).
  • Or updating documentation to warn about potential crashes, not just data inconsistency.

The SDK already has the infrastructure for thread checking (e.g., EmbraceSetupError.invalidThread), it could be extended to other public methods.

Mitigation

For other users that might find this issue. I am now enforcing access to the Embrace SDK on the main thread:

/// `client` is a property of type `Embrace?`
private func onMainThread(_ block: @escaping (Embrace) -> Void) {
    DispatchQueue.main.async {[client] in
        guard let client = client else {
            return
        }

        block(client)
    }
 }

func logBreadcrumb(_ message: String, properties: [String: String]?) {
    onMainThread { client in
       client.add(event: .breadcrumb(message, properties: properties ?? [:]))
    }
}
@ArielDemarco
Copy link
Collaborator

ArielDemarco commented Nov 4, 2024

Hey @RuiAAPeres thanks for submitting this issue; seems like the problem under the hood is that SpanData (from OpenTelemetrySdk provides access to different properties and most of them are not thread safe as far as I can see.
Probably, aside of documenting this, we'll probably have to create an anti-corruption layer to prevent this from happening, at least, from the public API.

Edit: Probably the suggestion/mitigation might help in most cases of our public API, but if somebody interacts with OpenTelemetry structures directly (like you do in both tests) crashes like this will still happen as those structures are not Thread Safe

@RuiAAPeres
Copy link
Author

RuiAAPeres commented Nov 5, 2024

but if somebody interacts with OpenTelemetry structures directly (like you do in both tests) crashes like this will still happen as those structures are not Thread Safe

In our app, all interactions with OpenTelemetry are indirectly via the Embrace SDK. The tests above serve as minimum replicable case and just used to narrow down the issue.

I would recommend sending an email to your existing customers warning about this situation. It's a pretty serious problem: the SDK used to report crashes is crashing.

@ArielDemarco
Copy link
Collaborator

Thank you for taking the time to create the tests; they are really useful for replicating the issue. Obviously, we will be notifying through various channels and updating the documentation/changelog accordingly.

That being said, we're somewhat limited in how much we can prevent some of these cases when users directly use OpenTelemetry structures. While we can (and we will) improve our APIs to enforce access from a single thread, if any consumer imports OpenTelemetrySdk and performs operations like the ones you've shown in the test, it's likely to crash. I understand that this isn't a normal/expected behavior, which is why I've created an issue in the OTel repository.

Next steps: in the next release, we will be preventing some of these issues, and hopefully, if improvements are also made in OpenTelemetry, we might be able to prevent any type of crash related to concurrency.

@ArielDemarco
Copy link
Collaborator

Made a contribution on the OTel repository to fix the issue that's causing this. A more detailed explanation of the underlying problem is on the issue I mentioned before.
This should be fixed in the next release of the SDK

@RuiAAPeres
Copy link
Author

Nice! Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants