-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disabling NR APM causes trace concatenation in Datadog #692
Comments
Other notes:
|
Additional thoughts, questions, ideas:
|
Testing this in stage and edge LMS. See edx/edx-arch-experiments#692
) Testing this in stage and edge LMS. See edx/edx-arch-experiments#692
Currently, we're investigating if using a NR Free Tier account for edxapp is enough to get DD traces working. Other possibilities may include trying to get tracing (or APM) disabled everywhere in Edge. This includes where Spans were found in the last day:
|
…tation) (#1) If the Django setting `EDX_NEWRELIC_NO_REPORT` is present and enabled, the agent will not talk to New Relic's servers and will instead use a set of previously captured responses from our sandbox account. Instrumentation (tracing, etc.) will still be in place, but the data will be discarded rather than being reported. See edx/edx-arch-experiments#692
[idea] We might want 3 modes for our hacked NR agent:
|
See #692 Testing setup: https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1173618788/Running+Datadog+in+devstack And then in lms-shell: ``` make requirements pip install ddtrace pip install -e /edx/src/archexp/ ./wrap-datadog.sh ./server.sh ``` Expect to see this log message: `Attached MissingSpanProccessor for Datadog diagnostics`
See #692 Testing setup: https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1173618788/Running+Datadog+in+devstack And then in lms-shell: ``` make requirements pip install ddtrace pip install -e /edx/src/archexp/ ./wrap-datadog.sh ./server.sh ``` Expect to see this log message: `Attached MissingSpanProccessor for Datadog diagnostics` NOTE: This prints "Spans created = 0; spans finished = 0" in devstack when shut down with ctrl-c, but not when restarted due to autoreload (where it prints correct info). Something is initializing Django twice and one span processor is getting span info while the other is printing at shutdown. There's more to debug here, but it seems stable enough to least try deploying it.
See #692 Testing setup: https://2u-internal.atlassian.net/wiki/spaces/ENG/pages/1173618788/Running+Datadog+in+devstack And then in lms-shell: ``` make requirements pip install ddtrace pip install -e /edx/src/archexp/ ./wrap-datadog.sh ./server.sh ``` Expect to see this log message: `Attached MissingSpanProccessor for Datadog diagnostics` NOTE: This prints "Spans created = 0; spans finished = 0" in devstack when shut down with ctrl-c, but not when restarted due to autoreload (where it prints correct info). Something is initializing Django twice and one span processor is getting span info while the other is printing at shutdown. There's more to debug here, but it seems stable enough to least try deploying it.
Adds logging diagnostics for traces in Datadog. See #692
Adds logging diagnostics for traces in Datadog. See #692
No longer needed for edx/edx-arch-experiments#692
No longer needed for edx/edx-arch-experiments#692
- Convert `/heartbeat` view into a celery test - Send Celery tasks to a broker, rather than running in-process - Hardcode a broker URL - Log all celery signals See edx/edx-arch-experiments#692
- Add `/celery_repro` URL to run a sample task - Send Celery tasks to a broker, rather than running in-process - Hardcode a broker URL - Log all celery signals See edx/edx-arch-experiments#692
Introduces `EDXAPP_NEWRELIC_ENABLE` and sets it to false so edxapp is no longer drawing from the common `COMMON_ENABLE_NEWRELIC_APP` variable. This is now possible thanks to fixes in ddtrace 2.14.2. See edx/edx-arch-experiments#692
…le (#74) Introduces `EDXAPP_NEWRELIC_ENABLE` and sets it to false so edxapp is no longer drawing from the common `COMMON_ENABLE_NEWRELIC_APP` variable. This is now possible thanks to fixes in ddtrace 2.14.2. See edx/edx-arch-experiments#692
We'll leave the datadog_diagnostics cleanup task for about a week, for cleanup 2024-10-14 or later. Moving to blocked for until then. |
These are no longer in use in edxapp as of: - edx/edx-internal#11806 (prod, stage) - edx/edge-internal#790 (edge) Also remove testing dependencies that were only in use by this app. See #692 for merge order.
These are no longer in use in edxapp as of: - edx/edx-internal#11806 (prod, stage) - edx/edge-internal#790 (edge) Also remove testing dependencies that were only in use by this app. See #692 for merge order.
These are no longer in use in edxapp as of: - edx/edx-internal#11806 (prod, stage) - edx/edge-internal#790 (edge) Also remove testing dependencies that were only in use by this app. See #692 for merge order.
Ultimately, this ticket is for disabling New Relic APM across edxapp. We ran into trace related issues in DD when first attempting to disable NR APM. We later caused the same issue in Edge when simply disabling NR Tracing.
This bug has been observed in edxapp (LMS and CMS), enterprise-catalog, and registrar. It can be identified by searching for spans matching
operation_name:django.request -@_top_level:*
.Acceptance criteria
Things we have already tried
These should be checked off once they have already been either reverted or made permanent:
DD_DJANGO_INSTRUMENT_MIDDLEWARE
to reduce the noise when debugging huge traces.DD_TRACE_HEADER_TAGS
to debug tracing headersoperation_name:django.request
on All Spans since service entry spans were unreliable.EDXAPP_NEWRELIC_LICENSE_TEST_FREE
) and removed from AWS secrets managerDD_TRACE_PROPAGATION_STYLE_EXTRACT=none
service:edx-edxapp-* dirname:"/edx/var/log/supervisor" "[edx_arch_experiments.datadog_diagnostics.middleware]"
EXTRA_MIDDLEWARE_CLASSES
Django setting) [stage and prod, edge]DATADOG_DIAGNOSTICS_
) -- if it's justDATADOG_DIAGNOSTICS_ENABLE
it can be merged in any order, as it's just controlling noisy logs we don't have any more. [prod LMS was only instance]datadog.diagnostics.
) -- merge in any order, as these turn features onDD_TRACE_CELERY_ENABLED=false
, because some of the request spans in anomalous traces have missing parent spans that were celery-related.DATADOG_DIAGNOSTICS_CELERY_LOG_SIGNALS
(using edx-arch-experiments 4.3.0)EDXAPP_DDTRACE_PIP_SPEC
) that closes celery spans using a fallbackEDXAPP_DDTRACE_PIP_SPEC
Details
When we disabled NR APM in edxapp on June 6 we observed two anomalies with traces:
service:edx-edxapp-lms env:prod
dropped precipitously by 2-3x.However, we believe the actual traffic was unchanged. This is corroborated by the Django hit metrics remaining steady, as seen in the Service Catalog. We cannot find any relevant code or config changes that would have been deployed around that time.
Our current understanding is that the majority of Django web requests that are traced are not recorded as service entry spans, but are instead parented to a different trace. This causes several problems:
We can also reproduce this issue by setting "Tracing type: None" in the application settings in NR (usually set to Distributed Tracing).
Links
The text was updated successfully, but these errors were encountered: