-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Messaging: per-message tracing when sending batches #1187
Comments
Have you thought about how this would be represented in the protocol? Would you then send a list of links along with the list of spans? The links could also be interpreted as "data-less spans" that have no timestamps (maybe a creation time?), and no attributes beside trace ID, span ID, parent span ID. CC @discostu105: I think at Dynytrace we have been using a concept very similar to these spanless contexts ("links") but IIUC we are in the process of getting rid of them. |
The same way as today (in both cases) - using span.links that already have all the properties we need (linked context and attributes). Or did I misunderstand your question? |
So you need attributes? Then it's not a pure "context". |
@lmolkova please correct me if I got it wrong, but I think the idea is that we have a "context" (no attributes/no real span) that's inside the message. Then on |
FYI from spec meeting this morning re an Events API support on Logs open-telemetry/opentelemetry-specification#2676 |
One idea that I wanted to bring up is to use zero-duration-spans after all and use a special span kind like "DOWNLINK". That way you could hide them / collapse them into the parent in an UI. Of course, this would need special support by the backend, but so would anything else we discuss here. |
@Oberon00 agree that if we follow 0-span duration path, backends need some heuristics that tell it's a span created for message. I believe we can do it with PRODUCER kind (and publish span, in this case, is just CLIENT). Still, such a span should have 0-duration, no status, no attributes, no links or events and it raises the question if it should be a span or something else. With a pure context, we keep the door open to adjust to real-world feedback. We can add an event or a span later in messaging v1.X in a non-breaking manner. Picking event with context semantics or span will be a final decision. |
I thought attributes were actually needed on these? |
Attributes should be on links to message context, not on spans. Reasons:
|
I don't understand. If we used zero-duration spans, of course it would make sense to put the attributes on the spans, and have the publish span as parent, and not using zero-duration spans and links at the sender side at the same time. Of course the message that is sent to the broker would only contain a pure context, completely independent of how this is implemented in OTel. |
Imagine I create a message on service A and publish it to Kafka topic. Service B receives it and forwards it to service C via another Kafka cluster/topic. It's quite a common scenario and there are many tools that do it. Or imagine I'm a user and keep source context in which blob/DB record was created in record metadata. I want to use this context as my message context and stamp it on the message manually. Auto-instrumentation that publishes this message cannot create a span and override message context. Where would you put attributes if there is no span? Asking user to create this span is not a great experience. The answer to both cases - put them on links. |
I might have completely misunderstood the design proposed here. I have a hunch what you might mean now, but I'm still not sure. So let me ask this: Why can't service B not create a messaging span? Since service B is not an intermediary, but a service, it ought to create a span with the incoming message as parent (or with a link to it) and modify the span context on the message with the span context of the the create/publish span of the new message publication. Am I wrong here? If service B cannot modify the context on the message, it will be impossible to tell from the trace structure if anything you link to the context on the message happened in service A or service B, and in which causal/happens-before relationship. |
Service B can create a span, but then it has to modify message context as well. Now let's assume ServiceB is a broker or, in a more popular case, an extra app layer that does geo-replication. While it could create a processing span, then create a new span for the message and modify the context, it'll be inefficient and verbose for the case of simple forwarding/routing/sharding. Moreover, assuming ServiceB is a broker, its telemetry could belong to the cloud provider it's managed by. Creating such spans would break causality. So the rule of thumb we came up with: if messaging library/system got a message with context (forwarded from somewhere else or set by user) - it must not create a span for message (or a new context). This context should be de-facto immutable. Now causality without message span is achieved through links.
In either case, we still have publish span on every hop that is linked to this context. You can follow along and see message received on ServiceB and republished there via links. |
This multi-tenant/multi-vendor problem can & should be solved with per-tenant/per-vendor tracestate entries. I think we should keep that discussion separate. open-telemetry/opentelemetry-specification#366 (comment)
To clarify: Of course the library/system would create a (publish) span, but it should not not inject that span's context into the message. Is that what you mean? I think this is a general new propagation design that you propose here, and I don't see how this is specific to messages. You could apply the same strategy to HTTP requests, which may also pass multiple hops (e.g. consider AWS Lambda which you usually invoke via a service called API gateway proxy, or Google Cloud Functions, which are behind a load balancer that actually participates in the W3C trace, trashing your span IDs, see open-telemetry/opentelemetry-specification#1852 (comment)) |
Sure, but let's make sure we keep the routing/replication/forwarding/sharing discussion on. Service-meshes would be first to hit the problem here.
Correct, in batch send, publish span context cannot be put on messages - if it does, they would not be individually traceable. Auto-instrumentation should allow applications to associate a custom context with the message.
It's specific to messages since:
The key difference here that for HTTP that request content is tightly coupled to the transport call and new call requires a new message, for messaging it's not the case. Assuming everything would have a span, forwarding A->B->C scenario would look like this:
(would you like that for every service mesh instrumentation?) Without context modification:
Both of these options carry the same information, but the first one is much more verbose. So what's the benefit? |
The same could be said about HTTP: The ultimate handler of the HTTP request contains application logic while any (reverse) proxies in-between are less relevant.
They do not. In the first scenario, you have the relationship A -> B -> C, and in the second one you only have A -> B and A -> C, i.e. a direct connection from A to both B and C, and only an indirect and undirected connection between B and C over the common parent A. You have no idea whether B forwarded to C, C forwarded to B, or A sent to B and C simultaneously (though the latter would be the most direct interpretation of the trace structure). That's what I meant by the loss of causal/happens-before relationships. I want to bring up that we seem to discuss two mostly orthogonal topics in this issue:
|
Perfect observation. So brokers and forwarders are like HTTP proxies and load balancers. They probably don't emit any traces, and when they do, they probably should not change HTTP headers, otherwise traces become too verbose.
It's a fair point. At the same time, the moment you introduce batching, you lose causality because links don't provide it. A: message s1 By looking at this the only way to tell that A called B is by timestamps.
Agreed, but they are related to some extent. To your second point - there are no siblings - they are all independent traces related via links. |
OK, so you are saying, in your first scenario, B not only does not modify the context, it also does not emit any telemetry items at all? If that's the case, I misunderstood that.
But if there is a trace, there has to be a span. So now I'm a bit confused what is actually meant here. |
I don't think this is a precise statement. There is a span, but it's a transport span that sends (a batch) to broker or receives a batch from broker. When we receive a batch, we can't always create span per message in auto-inst, it's app responsibility to create it if they want. We can only guarantee a receive span that links to each context in a batch. I.e. messages belong to application, application properties on the message are immutable for brokers and infra, they must not be modified. I.e. message trace context cannot be modified and no span must be created to re-trace this instance of message. |
To share another use case related to this discussion, we have a service that produces Kafka messages in a transaction as a large batch to a single topic. But that batch could be thousands of messages, so adding a link per message to a single span is not feasible, as links are normally limited to 128 if I understand correctly. Similarly, the array tag approach would result in a very large tag value. I wonder if the conventions could also provide semantics for a single span representing a batch of produced messages at a cost of losing granularity in the trace. Or maybe that's already addressed somewhere and I missed it. |
yes, closing this one. PS: I still like spanless contexts more 🙃 |
Reopening based on the feedback from @tedsuo to discuss zero-duration spans. Will bring it up on messaging SIG 6/27 |
capturing some feedback points:
|
Discussed at messaging SIG:
We should look for more options:
Will bring it up on spec meeting. |
I'm probably missing something here, how do you find the parent of the "new message" context? |
@lmolkova given we have the conventions now mentioning the create context and IIRC "zero duration" spans are not a big deal, do we have anything left to do in this issue? It seems to me all is "resolved" now? Or am I missing something? |
yeah, I think we can close it - we have #1273 to track remaining work (making per-message tracing disableable). Thanks! |
In Messaging Instrumentation WG, we're looking for the proper way to trace multiple messages sent within a single batch.
We do not have a concept for this in tracing spec and want to hear opinions on the options we came up with.
E.g. a user sends a batch of messages like
producer.send([msg1, msg2])
, then this batch is reshuffled on the broker and then each message is sent to consumer(s) as a part of another batch.In this case, users should still be able to trace individual messages through the system. To achieve it, we need a unique context per message that's propagated from producer to consumer.
Options:
send
span has links to each message span.send
span duration. It can potentially measure when each message is sent (but many systems get ack on batch, not per message)send
span has links to each message span context.More context: https://docs.google.com/document/d/1OrHsepd6GjzXKll1ggZyx1jBQd0d_t8NZXT1ZOem7D0/edit#heading=h.hfmrnf56kiuf
The text was updated successfully, but these errors were encountered: