[processor/tailsampling] Fixed sampling decision metrics #37212

yvrhdn · 2025-01-14T13:05:05Z

Description

Fixes some of the metrics emitted from sampling decisions. I believe otelcol_processor_tail_sampling_sampling_trace_dropped_too_early and otelcol_processor_tail_sampling_sampling_policy_evaluation_error_total are sometimes overcounted.

The bug: samplingPolicyOnTick creates a struct policyMetrics to hold on to some counters. This struct is shared for all the traces that are evaluated during that tick:

opentelemetry-collector-contrib/processor/tailsamplingprocessor/processor.go

Line 324 in 22c647a

metrics := policyMetrics{}

Each loop, the values of the counters are added to the metrics:

opentelemetry-collector-contrib/processor/tailsamplingprocessor/processor.go

Lines 340 to 344 in 22c647a

    
           tsp.telemetry.ProcessorTailSamplingSamplingDecisionTimerLatency.Record(tsp.ctx, int64(time.Since(startTime)/time.Microsecond)) 
        
           tsp.telemetry.ProcessorTailSamplingSamplingTraceDroppedTooEarly.Add(tsp.ctx, metrics.idNotFoundOnMapCount) 
        
           tsp.telemetry.ProcessorTailSamplingSamplingPolicyEvaluationError.Add(tsp.ctx, metrics.evaluateErrorCount) 
        
           tsp.telemetry.ProcessorTailSamplingSamplingTracesOnMemory.Record(tsp.ctx, int64(tsp.numTracesOnMap.Load())) 
        
           tsp.telemetry.ProcessorTailSamplingGlobalCountTracesSampled.Add(tsp.ctx, 1, decisionToAttribute[decision])

But the counters are not reset in between loops, so if the first evaluated trace could not be found this would set idNotFoundOnMapCount to 1.
Every loop after this will add 1 to otelcol_processor_tail_sampling_sampling_trace_dropped_too_early metric, even though the trace was found.

I've moved the metrics outside of the for loop so the counters are only added once.

Testing

I have added a dedicated test for each metric processing multiple traces in one tick.
I've a added a test for otelcol_processor_tail_sampling_sampling_trace_dropped_too_early.
I can add one for sampling_policy_evaluation_error too, just not sure how to deliberatly fail a policy.

# Conflicts: # processor/tailsamplingprocessor/processor_telemetry_test.go

portertech · 2025-01-15T19:05:30Z

@yvrhdn perhaps an ottl policy w/ an invalid condition would work to test sampling_policy_evaluation_error?

yvrhdn · 2025-01-16T13:17:00Z

@yvrhdn perhaps an ottl policy w/ an invalid condition would work to test sampling_policy_evaluation_error?

Cool, I've added a test for sampler_policy_evaluation_error as well 🙂

codecov · 2025-01-17T19:20:26Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.58%. Comparing base (25912dc) to head (8675fc7).
Report is 10 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #37212   +/-   ##
=======================================
  Coverage   79.58%   79.58%           
=======================================
  Files        2274     2274           
  Lines      212996   212997    +1     
=======================================
+ Hits       169509   169513    +4     
  Misses      37795    37795           
+ Partials     5692     5689    -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yvrhdn · 2025-01-20T11:31:25Z

I'm not sure why codecov is failing 😅 I added tests to validate the metrics are updated correctly, but since I didn't add any new codepaths % covered will not change.

jpkrohling

This can potentially catch people by surprise: they'll likely see different metric patterns for the same workload. I think this deserves a subtext explaining what can happen.

yvrhdn · 2025-01-24T13:42:32Z

This can potentially catch people by surprise: they'll likely see different metric patterns for the same workload. I think this deserves a subtext explaining what can happen.

Done!

yvrhdn added 2 commits January 14, 2025 13:09

[processor/tailsampling] Correct metrics samplingPolicyOnTick

139b9e3

Add test for trace_dropped_too_early

c80d6ae

yvrhdn requested review from jpkrohling and a team as code owners January 14, 2025 13:05

github-actions bot assigned codeboten Jan 14, 2025

github-actions bot added the processor/tailsampling Tail sampling processor label Jan 14, 2025

yvrhdn added 4 commits January 14, 2025 14:06

Add tailsamplingprocessor-fixed-sampling-decision-metrics.yaml

db070cb

Use full metric name

85a4bea

Merge branch 'main' into y/tsp-tick-metrics

44b9766

Merge remote-tracking branch 'base/main' into y/tsp-tick-metrics

04a7b15

# Conflicts: # processor/tailsamplingprocessor/processor_telemetry_test.go

portertech mentioned this pull request Jan 16, 2025

[processor/tailsampling] Late span age histogram should include sampled traces #37180

Merged

yvrhdn added 2 commits January 16, 2025 14:15

Add test for sampler_policy_evaluation_error

ed4e166

Merge branch 'main' into y/tsp-tick-metrics

935116e

yvrhdn added 3 commits January 17, 2025 10:47

Merge branch 'main' into y/tsp-tick-metrics

c48f864

Merge branch 'main' into y/tsp-tick-metrics

815263f

Merge branch 'main' into y/tsp-tick-metrics

8675fc7

Merge branch 'main' into y/tsp-tick-metrics

ef70e1c

github-actions bot requested a review from portertech January 20, 2025 11:30

portertech approved these changes Jan 20, 2025

View reviewed changes

jpkrohling approved these changes Jan 21, 2025

View reviewed changes

yvrhdn added 2 commits January 24, 2025 12:59

Add subtext

74172d1

Merge branch 'main' into y/tsp-tick-metrics

2000c86

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[processor/tailsampling] Fixed sampling decision metrics #37212

[processor/tailsampling] Fixed sampling decision metrics #37212

yvrhdn commented Jan 14, 2025 •

edited

Loading

portertech commented Jan 15, 2025

yvrhdn commented Jan 16, 2025

codecov bot commented Jan 17, 2025 •

edited

Loading

yvrhdn commented Jan 20, 2025

jpkrohling left a comment

yvrhdn commented Jan 24, 2025

	tsp.telemetry.ProcessorTailSamplingSamplingDecisionTimerLatency.Record(tsp.ctx, int64(time.Since(startTime)/time.Microsecond))
	tsp.telemetry.ProcessorTailSamplingSamplingTraceDroppedTooEarly.Add(tsp.ctx, metrics.idNotFoundOnMapCount)
	tsp.telemetry.ProcessorTailSamplingSamplingPolicyEvaluationError.Add(tsp.ctx, metrics.evaluateErrorCount)
	tsp.telemetry.ProcessorTailSamplingSamplingTracesOnMemory.Record(tsp.ctx, int64(tsp.numTracesOnMap.Load()))
	tsp.telemetry.ProcessorTailSamplingGlobalCountTracesSampled.Add(tsp.ctx, 1, decisionToAttribute[decision])

[processor/tailsampling] Fixed sampling decision metrics #37212

Are you sure you want to change the base?

[processor/tailsampling] Fixed sampling decision metrics #37212

Conversation

yvrhdn commented Jan 14, 2025 • edited Loading

Description

Testing

portertech commented Jan 15, 2025

yvrhdn commented Jan 16, 2025

codecov bot commented Jan 17, 2025 • edited Loading

Codecov Report

yvrhdn commented Jan 20, 2025

jpkrohling left a comment

Choose a reason for hiding this comment

yvrhdn commented Jan 24, 2025

yvrhdn commented Jan 14, 2025 •

edited

Loading

codecov bot commented Jan 17, 2025 •

edited

Loading