Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(celery): close celery.apply spans even without after_task_publish, when using apply_async #10676

Merged
merged 28 commits into from
Oct 1, 2024

Conversation

wantsui
Copy link
Collaborator

@wantsui wantsui commented Sep 16, 2024

The instrumentation for the Celery integration relies on various Celery signals in order to start and end the span when calling on apply_async.

The integration can fail if the expected signals don't trigger, which can lead to broken context propagation (and unexpected traces).

Example:

  • dd-trace-py expects the signal before_task_publish to start the span then after_task_publish to close the span. If the after_task_publish signal never gets called (which can happen if a Celery exception occurs while processing the app), then the span won't finish.
  • The same thing above can also happen to task_prerun and task_postrun.

Solution

This PR patches apply_async so that there is a check to see if there is a span lingering around and closes it when apply_task is called.

If an internal exception happens, the error will be marked on the celery.apply span.

To track this, I added new logs in debug mode:

The after_task_publish signal was not called, so manually closing span

and

The task_postrun signal was not called, so manually closing span

There's a related PR #10848 that works to improve how we extract information based on the protocols, that also affects when spans get closed or not.

Special Thanks:

  • Thanks to @tabgok for going through this with me in great detail!
  • @timmc-edx for helping us track it down!

APMS-13158

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

APMS-13158

@wantsui wantsui requested a review from tabgok September 16, 2024 21:13
@datadog-dd-trace-py-rkomorn
Copy link

datadog-dd-trace-py-rkomorn bot commented Sep 16, 2024

Datadog Report

Branch report: close-apply-async-celery-spans
Commit report: 81b2eb9
Test service: dd-trace-py

✅ 0 Failed, 820 Passed, 376 Skipped, 21m 13.4s Total duration (15m 15.32s time saved)

Copy link
Contributor

github-actions bot commented Sep 16, 2024

CODEOWNERS have been resolved as:

releasenotes/notes/fix-celery-apply-async-span-close-b7a8db188459f5b5.yaml  @DataDog/apm-python
ddtrace/contrib/internal/celery/app.py                                  @DataDog/apm-core-python @DataDog/apm-idm-python
ddtrace/contrib/internal/celery/signals.py                              @DataDog/apm-core-python @DataDog/apm-idm-python
tests/contrib/celery/test_integration.py                                @DataDog/apm-core-python @DataDog/apm-idm-python

@pr-commenter
Copy link

pr-commenter bot commented Sep 16, 2024

Benchmarks

Benchmark execution time: 2024-10-01 20:02:33

Comparing candidate commit 81b2eb9 in PR branch close-apply-async-celery-spans with baseline commit 7f6f3db in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 371 metrics, 53 unstable metrics.

@wantsui wantsui changed the title fix(celery) Close celery.apply spans when using apply_async, even when after_task_publish isn't called fix(celery): Close celery.apply spans even without after_task_publish, when using apply_async Sep 17, 2024
@wantsui wantsui changed the title fix(celery): Close celery.apply spans even without after_task_publish, when using apply_async fix(celery): close celery.apply spans even without after_task_publish, when using apply_async Sep 17, 2024
@wantsui wantsui marked this pull request as ready for review September 20, 2024 15:50
@wantsui wantsui requested review from a team as code owners September 20, 2024 15:50
wantsui added a commit that referenced this pull request Sep 30, 2024
This PR fixes an issue where Celery's closing signals got triggered but
dd-trace-py skipped closing the `celery.apply` span due to not finding
the task id.

In celery's `task_protocol: 1`, the id is in the message of the body:
https://docs.celeryq.dev/en/main/internals/protocol.html#message-body .

The issue with the previous logic is that if the headers does have
information (even if the headers were unrelated to the id), it would
skip the check of the id in the body:

before:

```
if headers:
```

after (this PR): 

```
if headers and 'id' in headers:
```

By doing this, we check the headers for the id, then check the body for
the id.

If it fails to find the task id in the body or header, then it still
hits the debug log, `unable to extract the Task and the task_id. This
version of Celery may not be supported.` .

This PR relates to the goal of
#10676 , to close celery
spans. If for some reason the logic in this PR fails to close an open
`celery.apply` span, #10676
will act as a fail safe and close it.

Special Thanks: @timmc-edx for helping us track this down!

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
github-actions bot pushed a commit that referenced this pull request Sep 30, 2024
This PR fixes an issue where Celery's closing signals got triggered but
dd-trace-py skipped closing the `celery.apply` span due to not finding
the task id.

In celery's `task_protocol: 1`, the id is in the message of the body:
https://docs.celeryq.dev/en/main/internals/protocol.html#message-body .

The issue with the previous logic is that if the headers does have
information (even if the headers were unrelated to the id), it would
skip the check of the id in the body:

before:

```
if headers:
```

after (this PR):

```
if headers and 'id' in headers:
```

By doing this, we check the headers for the id, then check the body for
the id.

If it fails to find the task id in the body or header, then it still
hits the debug log, `unable to extract the Task and the task_id. This
version of Celery may not be supported.` .

This PR relates to the goal of
#10676 , to close celery
spans. If for some reason the logic in this PR fails to close an open
`celery.apply` span, #10676
will act as a fail safe and close it.

Special Thanks: @timmc-edx for helping us track this down!

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

(cherry picked from commit 6346fcb)
github-actions bot pushed a commit that referenced this pull request Sep 30, 2024
This PR fixes an issue where Celery's closing signals got triggered but
dd-trace-py skipped closing the `celery.apply` span due to not finding
the task id.

In celery's `task_protocol: 1`, the id is in the message of the body:
https://docs.celeryq.dev/en/main/internals/protocol.html#message-body .

The issue with the previous logic is that if the headers does have
information (even if the headers were unrelated to the id), it would
skip the check of the id in the body:

before:

```
if headers:
```

after (this PR):

```
if headers and 'id' in headers:
```

By doing this, we check the headers for the id, then check the body for
the id.

If it fails to find the task id in the body or header, then it still
hits the debug log, `unable to extract the Task and the task_id. This
version of Celery may not be supported.` .

This PR relates to the goal of
#10676 , to close celery
spans. If for some reason the logic in this PR fails to close an open
`celery.apply` span, #10676
will act as a fail safe and close it.

Special Thanks: @timmc-edx for helping us track this down!

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

(cherry picked from commit 6346fcb)
github-actions bot pushed a commit that referenced this pull request Sep 30, 2024
This PR fixes an issue where Celery's closing signals got triggered but
dd-trace-py skipped closing the `celery.apply` span due to not finding
the task id.

In celery's `task_protocol: 1`, the id is in the message of the body:
https://docs.celeryq.dev/en/main/internals/protocol.html#message-body .

The issue with the previous logic is that if the headers does have
information (even if the headers were unrelated to the id), it would
skip the check of the id in the body:

before:

```
if headers:
```

after (this PR):

```
if headers and 'id' in headers:
```

By doing this, we check the headers for the id, then check the body for
the id.

If it fails to find the task id in the body or header, then it still
hits the debug log, `unable to extract the Task and the task_id. This
version of Celery may not be supported.` .

This PR relates to the goal of
#10676 , to close celery
spans. If for some reason the logic in this PR fails to close an open
`celery.apply` span, #10676
will act as a fail safe and close it.

Special Thanks: @timmc-edx for helping us track this down!

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

(cherry picked from commit 6346fcb)
wantsui added a commit that referenced this pull request Oct 1, 2024
… 2.14] (#10874)

Backport 6346fcb from #10848 to 2.14.

This PR fixes an issue where Celery's closing signals got triggered but
dd-trace-py skipped closing the `celery.apply` span due to not finding
the task id.

In celery's `task_protocol: 1`, the id is in the message of the body:
https://docs.celeryq.dev/en/main/internals/protocol.html#message-body .

The issue with the previous logic is that if the headers does have
information (even if the headers were unrelated to the id), it would
skip the check of the id in the body:

before:

```
if headers:
```

after (this PR): 

```
if headers and 'id' in headers:
```

By doing this, we check the headers for the id, then check the body for
the id.

If it fails to find the task id in the body or header, then it still
hits the debug log, `unable to extract the Task and the task_id. This
version of Celery may not be supported.` .

This PR relates to the goal of
#10676 , to close celery
spans. If for some reason the logic in this PR fails to close an open
`celery.apply` span, #10676
will act as a fail safe and close it.

Special Thanks: @timmc-edx for helping us track this down!

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

Co-authored-by: wantsui <[email protected]>
@wantsui wantsui merged commit 0d28e08 into main Oct 1, 2024
576 checks passed
@wantsui wantsui deleted the close-apply-async-celery-spans branch October 1, 2024 20:04
github-actions bot pushed a commit that referenced this pull request Oct 1, 2024
…sh, when using apply_async (#10676)

The instrumentation for the Celery integration relies on various [Celery
signals ](https://docs.celeryq.dev/en/stable/userguide/signals.html) in
order to start and end the span when calling on `apply_async`.

The integration can fail if the expected signals don't trigger, which
can lead to broken context propagation (and unexpected traces).

**Example:**
- dd-trace-py expects the signal `before_task_publish` to start the span
then `after_task_publish` to close the span. If the `after_task_publish`
signal never gets called (which can happen if a Celery exception occurs
while processing the app), then the span won't finish.
- The same thing above can also happen to `task_prerun` and
`task_postrun`.

**Solution**

This PR patches `apply_async` so that there is a check to see if there
is a span lingering around and closes it when `apply_task` is called.

If an internal exception happens, the error will be marked on the
`celery.apply` span.

To track this, I added new logs in debug mode:
> The after_task_publish signal was not called, so manually closing span

and
> The task_postrun signal was not called, so manually closing span

There's a related PR #10848
that works to improve how we extract information based on the protocols,
that also affects when spans get closed or not.

Special Thanks:
- Thanks to @tabgok for going through this with me in great detail!
- @timmc-edx for helping us track it down!

[APMS-13158]

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

APMS-13158

[APMS-13158]:
https://datadoghq.atlassian.net/browse/APMS-13158?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

---------

Co-authored-by: Emmett Butler <[email protected]>
(cherry picked from commit 0d28e08)
github-actions bot pushed a commit that referenced this pull request Oct 1, 2024
…sh, when using apply_async (#10676)

The instrumentation for the Celery integration relies on various [Celery
signals ](https://docs.celeryq.dev/en/stable/userguide/signals.html) in
order to start and end the span when calling on `apply_async`.

The integration can fail if the expected signals don't trigger, which
can lead to broken context propagation (and unexpected traces).

**Example:**
- dd-trace-py expects the signal `before_task_publish` to start the span
then `after_task_publish` to close the span. If the `after_task_publish`
signal never gets called (which can happen if a Celery exception occurs
while processing the app), then the span won't finish.
- The same thing above can also happen to `task_prerun` and
`task_postrun`.

**Solution**

This PR patches `apply_async` so that there is a check to see if there
is a span lingering around and closes it when `apply_task` is called.

If an internal exception happens, the error will be marked on the
`celery.apply` span.

To track this, I added new logs in debug mode:
> The after_task_publish signal was not called, so manually closing span

and
> The task_postrun signal was not called, so manually closing span

There's a related PR #10848
that works to improve how we extract information based on the protocols,
that also affects when spans get closed or not.

Special Thanks:
- Thanks to @tabgok for going through this with me in great detail!
- @timmc-edx for helping us track it down!

[APMS-13158]

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

APMS-13158

[APMS-13158]:
https://datadoghq.atlassian.net/browse/APMS-13158?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

---------

Co-authored-by: Emmett Butler <[email protected]>
(cherry picked from commit 0d28e08)
github-actions bot pushed a commit that referenced this pull request Oct 1, 2024
…sh, when using apply_async (#10676)

The instrumentation for the Celery integration relies on various [Celery
signals ](https://docs.celeryq.dev/en/stable/userguide/signals.html) in
order to start and end the span when calling on `apply_async`.

The integration can fail if the expected signals don't trigger, which
can lead to broken context propagation (and unexpected traces).

**Example:**
- dd-trace-py expects the signal `before_task_publish` to start the span
then `after_task_publish` to close the span. If the `after_task_publish`
signal never gets called (which can happen if a Celery exception occurs
while processing the app), then the span won't finish.
- The same thing above can also happen to `task_prerun` and
`task_postrun`.

**Solution**

This PR patches `apply_async` so that there is a check to see if there
is a span lingering around and closes it when `apply_task` is called.

If an internal exception happens, the error will be marked on the
`celery.apply` span.

To track this, I added new logs in debug mode:
> The after_task_publish signal was not called, so manually closing span

and
> The task_postrun signal was not called, so manually closing span

There's a related PR #10848
that works to improve how we extract information based on the protocols,
that also affects when spans get closed or not.

Special Thanks:
- Thanks to @tabgok for going through this with me in great detail!
- @timmc-edx for helping us track it down!

[APMS-13158]

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

APMS-13158

[APMS-13158]:
https://datadoghq.atlassian.net/browse/APMS-13158?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

---------

Co-authored-by: Emmett Butler <[email protected]>
(cherry picked from commit 0d28e08)
wantsui added a commit that referenced this pull request Oct 1, 2024
… 2.12] (#10872)

Backport 6346fcb from #10848 to 2.12.

This PR fixes an issue where Celery's closing signals got triggered but
dd-trace-py skipped closing the `celery.apply` span due to not finding
the task id.

In celery's `task_protocol: 1`, the id is in the message of the body:
https://docs.celeryq.dev/en/main/internals/protocol.html#message-body .

The issue with the previous logic is that if the headers does have
information (even if the headers were unrelated to the id), it would
skip the check of the id in the body:

before:

```
if headers:
```

after (this PR): 

```
if headers and 'id' in headers:
```

By doing this, we check the headers for the id, then check the body for
the id.

If it fails to find the task id in the body or header, then it still
hits the debug log, `unable to extract the Task and the task_id. This
version of Celery may not be supported.` .

This PR relates to the goal of
#10676 , to close celery
spans. If for some reason the logic in this PR fails to close an open
`celery.apply` span, #10676
will act as a fail safe and close it.

Special Thanks: @timmc-edx for helping us track this down!

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

Co-authored-by: wantsui <[email protected]>
wantsui added a commit that referenced this pull request Oct 1, 2024
… 2.13] (#10873)

Backport 6346fcb from #10848 to 2.13.

This PR fixes an issue where Celery's closing signals got triggered but
dd-trace-py skipped closing the `celery.apply` span due to not finding
the task id.

In celery's `task_protocol: 1`, the id is in the message of the body:
https://docs.celeryq.dev/en/main/internals/protocol.html#message-body .

The issue with the previous logic is that if the headers does have
information (even if the headers were unrelated to the id), it would
skip the check of the id in the body:

before:

```
if headers:
```

after (this PR): 

```
if headers and 'id' in headers:
```

By doing this, we check the headers for the id, then check the body for
the id.

If it fails to find the task id in the body or header, then it still
hits the debug log, `unable to extract the Task and the task_id. This
version of Celery may not be supported.` .

This PR relates to the goal of
#10676 , to close celery
spans. If for some reason the logic in this PR fails to close an open
`celery.apply` span, #10676
will act as a fail safe and close it.

Special Thanks: @timmc-edx for helping us track this down!

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

Co-authored-by: wantsui <[email protected]>
wantsui added a commit that referenced this pull request Oct 1, 2024
…sh, when using apply_async [backport 2.14] (#10893)

Backport 0d28e08 from #10676 to 2.14.

The instrumentation for the Celery integration relies on various [Celery
signals ](https://docs.celeryq.dev/en/stable/userguide/signals.html) in
order to start and end the span when calling on `apply_async`.

The integration can fail if the expected signals don't trigger, which
can lead to broken context propagation (and unexpected traces).

**Example:**
- dd-trace-py expects the signal `before_task_publish` to start the span
then `after_task_publish` to close the span. If the `after_task_publish`
signal never gets called (which can happen if a Celery exception occurs
while processing the app), then the span won't finish.
- The same thing above can also happen to `task_prerun` and
`task_postrun`.

**Solution**

This PR patches `apply_async` so that there is a check to see if there
is a span lingering around and closes it when `apply_task` is called.

If an internal exception happens, the error will be marked on the
`celery.apply` span.

To track this, I added new logs in debug mode:
> The after_task_publish signal was not called, so manually closing span

and 
> The task_postrun signal was not called, so manually closing span


There's a related PR #10848
that works to improve how we extract information based on the protocols,
that also affects when spans get closed or not.

Special Thanks:
- Thanks to @tabgok for going through this with me in great detail!
- @timmc-edx for helping us track it down!

[APMS-13158]

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)


APMS-13158

[APMS-13158]:
https://datadoghq.atlassian.net/browse/APMS-13158?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Co-authored-by: wantsui <[email protected]>
wantsui added a commit that referenced this pull request Oct 1, 2024
…sh, when using apply_async [backport 2.13] (#10892)

Backport 0d28e08 from #10676 to 2.13.

The instrumentation for the Celery integration relies on various [Celery
signals ](https://docs.celeryq.dev/en/stable/userguide/signals.html) in
order to start and end the span when calling on `apply_async`.

The integration can fail if the expected signals don't trigger, which
can lead to broken context propagation (and unexpected traces).

**Example:**
- dd-trace-py expects the signal `before_task_publish` to start the span
then `after_task_publish` to close the span. If the `after_task_publish`
signal never gets called (which can happen if a Celery exception occurs
while processing the app), then the span won't finish.
- The same thing above can also happen to `task_prerun` and
`task_postrun`.

**Solution**

This PR patches `apply_async` so that there is a check to see if there
is a span lingering around and closes it when `apply_task` is called.

If an internal exception happens, the error will be marked on the
`celery.apply` span.

To track this, I added new logs in debug mode:
> The after_task_publish signal was not called, so manually closing span

and 
> The task_postrun signal was not called, so manually closing span


There's a related PR #10848
that works to improve how we extract information based on the protocols,
that also affects when spans get closed or not.

Special Thanks:
- Thanks to @tabgok for going through this with me in great detail!
- @timmc-edx for helping us track it down!

[APMS-13158]

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)


APMS-13158

[APMS-13158]:
https://datadoghq.atlassian.net/browse/APMS-13158?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Co-authored-by: wantsui <[email protected]>
Co-authored-by: erikayasuda <[email protected]>
wantsui added a commit that referenced this pull request Oct 1, 2024
…sh, when using apply_async [backport 2.12] (#10891)

Backport 0d28e08 from #10676 to 2.12.

The instrumentation for the Celery integration relies on various [Celery
signals ](https://docs.celeryq.dev/en/stable/userguide/signals.html) in
order to start and end the span when calling on `apply_async`.

The integration can fail if the expected signals don't trigger, which
can lead to broken context propagation (and unexpected traces).

**Example:**
- dd-trace-py expects the signal `before_task_publish` to start the span
then `after_task_publish` to close the span. If the `after_task_publish`
signal never gets called (which can happen if a Celery exception occurs
while processing the app), then the span won't finish.
- The same thing above can also happen to `task_prerun` and
`task_postrun`.

**Solution**

This PR patches `apply_async` so that there is a check to see if there
is a span lingering around and closes it when `apply_task` is called.

If an internal exception happens, the error will be marked on the
`celery.apply` span.

To track this, I added new logs in debug mode:
> The after_task_publish signal was not called, so manually closing span

and 
> The task_postrun signal was not called, so manually closing span


There's a related PR #10848
that works to improve how we extract information based on the protocols,
that also affects when spans get closed or not.

Special Thanks:
- Thanks to @tabgok for going through this with me in great detail!
- @timmc-edx for helping us track it down!

[APMS-13158]

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)


APMS-13158

[APMS-13158]:
https://datadoghq.atlassian.net/browse/APMS-13158?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Co-authored-by: wantsui <[email protected]>
Co-authored-by: erikayasuda <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants