Generated service-graph metrics for messaging_system connection type #3210
Replies: 1 comment 1 reply
-
Yes, I agree this is definitely a use case we need to improve on. With synchronous calls network overhead can be estimated using parent duration - child duration, but we don't have a good way to show the distance between parent span end and child span start which is what async/queuing systems need. The fundamental challenge is for these spans to be paired up we would need long waits in the metrics generator. The cost isn't that much since we store very little data when calculating service graphs, but it's an non-obvious operational issue. I would support adjusting the way we do service graph metrics for producer/consumer for the reason you highlighted above. Honestly the metrics generator as a whole needs some improvements. We could reduce series by dropping the histograms not needed for the svc graph and there have been a number of small feature requests for it. It's an area that needs some attention. |
Beta Was this translation helpful? Give feedback.
-
I was wondering about the correctness of the way the service_graph metrics generated by the Metrics Generator are calculated for server side metrics.
As for the "classic" client-server interaction I have no doubt, but for the producer-consumer ones (
connection_type=messaging_system
), I believe it shouldn't be as how it is today (or at least expand to give a solution to another use case).The case I'm describing is when the total latency (from the start time of the producer until the end time of the consumer) is the latency needed to be measured.
Imagine a situation where a message was published to some queue A, (publish span took 10ms), then the message sits in the queue for 20s (because no consumer was available to process another message for example), and then finally pulled and processed in 15ms by the consumer.
The service-graph metrics will only "count" the 10ms for the producer side, and the 15ms for the consumer side (consumer calculates it's latency by the consumer span duration),
Leaving the 20s delay not mentioned on any side and in any metric...
I believe this 20s latency should be considered somehow, what do you guys think? 🙃
Beta Was this translation helpful? Give feedback.
All reactions