Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/tailsampling] Fixed sampling decision metrics #37212

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

yvrhdn
Copy link
Contributor

@yvrhdn yvrhdn commented Jan 14, 2025

Description

Fixes some of the metrics emitted from sampling decisions. I believe otelcol_processor_tail_sampling_sampling_trace_dropped_too_early and otelcol_processor_tail_sampling_sampling_policy_evaluation_error_total are sometimes overcounted.

The bug: samplingPolicyOnTick creates a struct policyMetrics to hold on to some counters. This struct is shared for all the traces that are evaluated during that tick:

Each loop, the values of the counters are added to the metrics:

tsp.telemetry.ProcessorTailSamplingSamplingDecisionTimerLatency.Record(tsp.ctx, int64(time.Since(startTime)/time.Microsecond))
tsp.telemetry.ProcessorTailSamplingSamplingTraceDroppedTooEarly.Add(tsp.ctx, metrics.idNotFoundOnMapCount)
tsp.telemetry.ProcessorTailSamplingSamplingPolicyEvaluationError.Add(tsp.ctx, metrics.evaluateErrorCount)
tsp.telemetry.ProcessorTailSamplingSamplingTracesOnMemory.Record(tsp.ctx, int64(tsp.numTracesOnMap.Load()))
tsp.telemetry.ProcessorTailSamplingGlobalCountTracesSampled.Add(tsp.ctx, 1, decisionToAttribute[decision])

But the counters are not reset in between loops, so if the first evaluated trace could not be found this would set idNotFoundOnMapCount to 1.
Every loop after this will add 1 to otelcol_processor_tail_sampling_sampling_trace_dropped_too_early metric, even though the trace was found.

I've moved the metrics outside of the for loop so the counters are only added once.

Testing

I have added a dedicated test for each metric processing multiple traces in one tick.
I've a added a test for otelcol_processor_tail_sampling_sampling_trace_dropped_too_early.
I can add one for sampling_policy_evaluation_error too, just not sure how to deliberatly fail a policy.

@yvrhdn yvrhdn requested review from jpkrohling and a team as code owners January 14, 2025 13:05
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Jan 14, 2025
@portertech
Copy link
Contributor

@yvrhdn perhaps an ottl policy w/ an invalid condition would work to test sampling_policy_evaluation_error?

@yvrhdn
Copy link
Contributor Author

yvrhdn commented Jan 16, 2025

@yvrhdn perhaps an ottl policy w/ an invalid condition would work to test sampling_policy_evaluation_error?

Cool, I've added a test for sampler_policy_evaluation_error as well 🙂

Copy link

codecov bot commented Jan 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.58%. Comparing base (25912dc) to head (8675fc7).
Report is 10 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #37212   +/-   ##
=======================================
  Coverage   79.58%   79.58%           
=======================================
  Files        2274     2274           
  Lines      212996   212997    +1     
=======================================
+ Hits       169509   169513    +4     
  Misses      37795    37795           
+ Partials     5692     5689    -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot requested a review from portertech January 20, 2025 11:30
@yvrhdn
Copy link
Contributor Author

yvrhdn commented Jan 20, 2025

I'm not sure why codecov is failing 😅 I added tests to validate the metrics are updated correctly, but since I didn't add any new codepaths % covered will not change.

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can potentially catch people by surprise: they'll likely see different metric patterns for the same workload. I think this deserves a subtext explaining what can happen.

@yvrhdn
Copy link
Contributor Author

yvrhdn commented Jan 24, 2025

This can potentially catch people by surprise: they'll likely see different metric patterns for the same workload. I think this deserves a subtext explaining what can happen.

Done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
processor/tailsampling Tail sampling processor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants