Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(core): add level of indirection for provider.py contextvars #10525

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

sanchda
Copy link
Contributor

@sanchda sanchda commented Sep 5, 2024

Whenever a contextvar is reassociated, it causes the underlying HAMT data structure to clone a node. This clone operation requires de-referencing stored Python objects, which can cause segmentation faults if other libraries mis-manage the reference counts for their objects, causing them to be GC'd.

This patch stores a single wrapper object into the contextvar, then manipulates a reference within that wrapper in order to propagate our desired information. In our normal testing fixture, we cause as many as 69 realloc (and clone) events in a single process (I deduce this by patching cpython itself to produce a log). With this patch, that number is down to 1, and it doesn't originate from this provider.py

I have a standalone reproduction for the noted behavior here. The repro isn't very clever about how it manages the lifetimes of GC'd objects--the issues we see in the wild are a little bit more subtle, since they don't segfault during normal scope cleanup (unlike mine).

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

Copy link
Contributor

github-actions bot commented Sep 5, 2024

CODEOWNERS have been resolved as:

releasenotes/notes/fix-contextvar-cloning-49adaf7fdf36e8fb.yaml         @DataDog/apm-python
ddtrace/_trace/provider.py                                              @DataDog/apm-sdk-api-python

@emmettbutler
Copy link
Collaborator

interesting and promising

@pr-commenter
Copy link

pr-commenter bot commented Sep 5, 2024

Benchmarks

Benchmark execution time: 2024-09-19 16:40:04

Comparing candidate commit 746d6ce in PR branch sanchda/make_contextvars_indirect with baseline commit 742a579 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 353 metrics, 47 unstable metrics.

@taegyunkim
Copy link
Contributor

Could you run hatch run lint:fmt to format and trigger the rest of circle ci?

@datadog-dd-trace-py-rkomorn
Copy link

datadog-dd-trace-py-rkomorn bot commented Sep 5, 2024

Datadog Report

Branch report: sanchda/make_contextvars_indirect
Commit report: 746d6ce
Test service: dd-trace-py

✅ 0 Failed, 1113 Passed, 173 Skipped, 29m 51.39s Total duration (7m 10.65s time saved)

Copy link
Contributor

@Yun-Kim Yun-Kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to help my understanding, would you be able to update the PR description with a small snippet example of a reassoc event that would trigger a segfault/corruption? Thanks!

@sanchda
Copy link
Contributor Author

sanchda commented Sep 6, 2024

Just to help my understanding, would you be able to update the PR description with a small snippet example of a reassoc event that would trigger a segfault/corruption? Thanks!

Unfortunately, any ContextVar.set() operation will trigger a re-association if the given context already exists. The sad truth of the matter is that CPython's implementation of ContextVars bleeds a lot of common state together. I have a standalone reproduction here. I'll link it in the PR description.

@r1viollet
Copy link
Contributor

I agree with the code, but I still can not understand why I get a different type of context within greenlets.
I now get an actual context (and not a span). I am not sure what is the expected behaviour.

@Yun-Kim Yun-Kim changed the title fix(core): add level of indirection for provider.py contetvars fix(core): add level of indirection for provider.py contextvars Sep 9, 2024
Copy link
Contributor

@r1viollet r1viollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Yun-Kim Yun-Kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR makes sense to me but I must admit I'm not nuanced in contextvars internals enough to give a more thoughtful review here. Is this not an issue with CPython's implementation of ContextVars, and if so, should it be fixed upstream?

@r1viollet
Copy link
Contributor

This PR makes sense to me but I must admit I'm not nuanced in contextvars internals enough to give a more thoughtful review here. Is this not an issue with CPython's implementation of ContextVars, and if so, should it be fixed upstream?

I agree there could be an effort to understand what objects have lifetime issues. Though we still have to mitigate in the short term.

@sanchda
Copy link
Contributor Author

sanchda commented Sep 19, 2024

I dropped this PR for a while because we shipped a patch to some affected customers and it didn't directly fix the issue we saw. However, it did reveal more information about where bad objects were being consumed. I count that as a net win.

So, I'd like to summarize the benefits of this PR as the following

  • Fewer clone operations means better performance
  • Protects us from cpython global state
  • ... and in so doing, makes crashes more actionable

@Yun-Kim:

This PR makes sense to me but I must admit I'm not nuanced in contextvars internals enough to give a more thoughtful review here. Is this not an issue with CPython's implementation of ContextVars, and if so, should it be fixed upstream?

I don't know whether I'd call it an issue in the implementation or the design. There are a few problems here.

  • CPython directly uses the system allocator--it doesn't manage objects through the kind of indirection that the JVM does, so it's entirely possible for misbehaving code to convert a strong ref (e.g., a contextvar) into a weak ref, causing issues to propagate. Solving this problem in a durable way would be extremely difficult.
  • Even if the CPython maintainers did treat this as an implementation issue, they will not backport such a fix to versions which we currently support (e.g., 3.9). That means we're on the hook for dealing with those versions safely. I think this PR does that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants