-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow INTERNAL
GenAI/db spans instead of requiring the kind to be CLIENT
#1315
Comments
I do agree that INTERNAL makes sense too. It's not a GenAI-specific problem - the same problem exists in messaging, database, or any other conventions for logical calls that happen on top of instrumented transport. But you never know if this transport is instrumented. Application may choose to disable (not enable) it, or disable it in scope of a specific library. E.g. AWS client instrumentations usually allow to suppress nested HTTP spans. Arguably client-side instrumentation never knows if the transport is instrumented, therefore cannot set kind conditionally. Long story short, I think the client/internal span distinction is not useful as it exists today. All other logical client semconv (messaing/db, grpc) also require client kind for the above reasons. |
Because spring instrumentation began with zipkin semantics, it can be surprising to find things that lie over an http transport, and have no ability to set headers, marked as client anyway. Partially replicating transport properties such as server address is even more surprising. I had the same feedback. In zipkin, not otel, instrumentation "client" indicates "exit span" outbound trace propagation except where there is none possible (db clients). The most common client were http and grpc, both of which could set headers, as well any other rpc'esque thing like dubbo. Sure double-instrumentation could happen, but in zipkin we would never mark something client that is most often a layer over http, unless it could set headers. So, absolutely, this is different. What I learned from folks including @lmolkova is that client is vague. There is no coupling with "remote" or "exit span" or ability to set headers. It is a software category and says nothing about transport. So, I agree that client/internal span is a coin toss. It bothers me we set this to client, but not as much in context of how semantics are defined. |
Related: I was extremely surprised to find all of the instrumentation sdks skipping the http client under openai or all the other REST clients that implement this. I tried half a dozen: not one added http instrumentation to openai calls. I would expect by end of the year latest, many people will notice this missing, and ask to fix that, especially as failures at the transport layer are important to know. So, more people will notice this as a result. More people seeing this in practice may change the coin toss, but it may also result in the same. We'll see! |
Thank you both for your answers. They helped me understand better the topic. I was indeed thinking of client spans in OpenTelemetry the same way they are used in OpenZipkin: an "exit span" towards a remote service. I might have overlooked a bit the OTel definition. I guess the key part is "usually", so it's not always an "exit span" towards a remote service. Did I understand correctly?
Still, it seems a bit odd to me getting a series of nested spans with kind With respect to Java/Spring, the transport layer comes with the core framework and it's always instrumented (unless the user decides to turn that off). For clarity and consistency with the rest of the Java ecosystem, I would still use @brunobat what's your take from Quarkus?
I noticed that too. And that might become a problem soon, because some backend solutions/platforms that are embracing the GenAI Semantic Conventions seem to have some implicit assumption that an LLM-powered application would only send GenAI spans and metrics. When using those services from a production application instrumented at different levels, some of those solutions break because they are not expecting other types of spans/metrics other than the GenAI ones. Something similar might happen with vector databases. I haven't look that deep into that part of observability yet, but it seems many libraries are not considering existing instrumentation of underlying DB transport libraries. |
@ThomasVitale PS I want to be very clear, even though I'm not asked and also have no karma on this repo. I understand the argument that you CAN make the llm span kind=client, but 100pct think that's the wrong optimization. The least bad choice is INTERNAL, and doing so encourages people to instrument the http calls underneath and prevent preventable problems such as broken tracing or lack of awareness of http failures. Basically, it is more likely there is an actual instrumentable client under the llm span than not, and if that isn't there bad things will happen sooner or later. It is better to optimize for the usually case than the usually not in decisions. Even though I yield to things I don't agree with or don't prefer, this is not the right choice and should be corrected. |
History TL;DR; kind had been undefined until april this year. It isn't completely converged either, even in new implementations. However, most current python sdks set kind=client. As far as my research shows, @drewby made the suggestion to set this to client and @nirga accepted that feedback. I see no evidence of a long discussion about the nuance of this choice, but maybe it happed in slack or on a call. So, the first openai instrumentation was from @cartermp over a year ago, and this commit "Add embeddings and make it work with autoinstrumentation" switched to kind=client. In Sep last year, carter raised an issue to make define llm semantics. His proposal didn't actually indicate a span kind preference at the llm level, so it would have been internal by default. He forked the semantics repo and worked on upstream definition, which became a PR in early November. This also didn't indicate a span kind preference. @nirga took over the effort after above fizzled out into a new PR this march. Initially it also didn't indicate a span kind Before this PR was merged, @drewby requested a must on kind=client, and Nir accepted that. So basically, this is the history. The primary language community has been python, and so the specs have been primarily driven or followed by python impls. I have looked at many python sdks now with at least two developers: AgiFlow, langrace, litellm, openinference, openlit, and openllmetry. Most have converged to kind=client, either intentionally or due to "born after date" as it is still a new ecosystem. A datapoint on this of interest was that there are still a couple python SDKs which have kind=internal. I highlighted to each that the spec asks for this to be kind=client despite not personally agreeing with it. openinference cc @mikeldking https://github.com/Arize-ai/openinference/blob/22a93f9d8445ddb7269e23b5b10fcd007cc2a0d2/js/packages/openinference-instrumentation-openai/src/instrumentation.ts#L133 Next steps: If we agree that the choice was made without the context provided here, I think it is fair to revisit this decision. We have changed other properties in many incompatible ways. My suggestion is to get feedback from @drewby about the "must" part, and weigh in based on new information that exists today. The approver+ community of this spec is currently @lmolkova @drewby @cartermp and @nirga and a change would require buy-in from them, informed by implementation opinions as well passer bys. Hope this helps! |
We discussed this in one of the early Working Group meetings. The primary topic was on how to name the spans, and within that conversation, we considered other semantics for database, HTTP, etc., and decided on CLIENT as the span type. I want to ensure you that this was discussed, albeit briefly, though it may not be captured in the review/commit history. One argument for keeping this as CLIENT is the differentiation it provides in orchestrating tools like LangChain. Specifically, it helps distinguish between internal steps (such as combining context and prompt) and external steps calling the LLM. Referencing the SpanKind documentation:
In the case of instrumenting orchestrators, there is a logical span that is calling out to a server which is different than an internal span doing work locally. Given that our working group membership is expanding, and our conventions are marked experimental, revisiting this decision is possible. We could maintain the CLIENT designation and improve the documentation to provide more clarity. Alternatively, we could re-evaluate the "MUST" requirement for CLIENT in light of new information and feedback, or remove the CLIENT requirement and default to INTERNAL, aligning with the initial design. |
I want to re-iterate a few things:
So giving instrumentations a flexibility would mean that they would populate it inconsistently depending on author familiarity with different use-cases and configuration options. Since other OTel semantic conventions REQUIRE CLIENT, GenAI should not be different and should REQUIRE CLIENT too. When I'm saying that INTERNAL makes sense, I mean conceptually - the INTERNAL is described so vaguely, you can call anything an internal span. |
@lmolkova do you mind being specific when you say "all other similar conventions" maybe edit your comment? I would expect "similar" to be something like rest layers, such as repository interfaces, control APIs like k8s, etc. being specific will allow us to be on the same page on what we feel things like OpenAI are similar to. fyi I found this from @mikeldking Arize-ai/openinference#609 to change openinference to client |
trying to help add more rigor here, since I understand one argument is what this spec is closer to. one thing is clear that when something is a subtype of RPC it would be client for this side of work. So, for example, aws-sdk is a subtype of that so implied kind=client. aws-sdk is a control api riding over a rest interface, so that's common. However, we don't currently (to my knowledge) define llm/genai in terms of rpc. So, if this rpc implied maybe we can make that more explicit in the spec, such as done in aws-sdk Something a little less precise is the graphql semantic which is also a layer over http. It says the word "server" in the title, but doesn't discuss RPC or span kind in the doc, yet. Personally, I don't think any of the existing semver are by nature like genai (e.g. openai in impl), but I understand there are strong feelings to the opposite. |
@codefromthecrypt Databases and Messaging come to mind. Both will either require CLIENT or CONSUMER/PRODUCER on top of another layer that could potentially be HTTP rest API. These define logical requests that have meaning (ie. descriptive attributes associated with a thing that is a request, but not something you'd add to http span). |
@drewby real question.. how is a prompt/completion request similar to a database or messaging? |
both database and messaging are logical operations that run on top of some protocol (HTTP, gRPC, something not instrumented on the low level like SQL), In all cases, we create 1 logical span for outer operation and as many protocol-level spans as there were protocol calls. E.g. if there are 2 retries, there will be parent LLM/DB/Messaging span - OK
or if there was authentication call, the would be something like parent LLM/DB/Messaging span - OK
I.e. LLM instrumentation applies on the public API of the corresponding client library and above any retry/redirect/auth handlers in HTTP (or other protocol) pipelines. |
ok I see where you are coming from even though I personally believe database and messaging are far wider abstractions than this. Appreciate you explaining what you feel is similar 👍 |
Since native/auto HTTP/gRPC instrumentations are already there, the LLM has no other choice than to follow the same pattern:
I.e. it does not know whether there is a client span or there is no client span under it. So it has no better choice than to be client itself. In this sense there is no difference between LLM, S3, or Elasticsearch. |
Thanks everyone for all the comments. After reading all your contributions, it looks to me that the description of Span Kind in the OpenTelemetry Specification is lacking something, and I'm glad to see @lmolkova starting a change request there. Did I understand correctly that all spans that eventually might lead to a remote service call should be of kind I'm also wondering to what extent we need to make up for missing instrumentations at the transport level when instrumenting other layers (e.g. LLM) that build on top of those. Are there guidelines about which attributes to "duplicate" across different levels when defining conventions (e.g. Apologies if they're silly questions, I'm trying to understand better how to assign correctly a kind to a given span with OpenTelemetry, with the right attributes. I appreciate all your help with clarifying the subject. |
Yes.
Yes to some extent. Since multiple server layers are (at least sometimes) coming from the same framework instrumentation, it could be possible to coordinate who creates SERVER within that framework. Still, in Java multiple nested SERVERs could easily happen. OTel java supports suppressing nested spans - https://opentelemetry.io/docs/zero-code/java/agent/disable/#instrumentation-span-suppression-behavior I believe the default is that 'same-type-of-semconv' spans are suppressed. E.g. HTTP clients could use each other and create duplicate spans. Java Instrumentation API in Otel suppresses nested HTTP client spans if there is already HTTP client span on the context. It supports other forms of suppression - e.g. by kind. I.e. in the scope of LLM CLIENT, HTTP CLIENT will be suppressed. Some users prefer to have only logical instrumentations, others want only physical ones or both of them - we want them to be able to pick the layers that they want.
I think we don't have an answer to this in the community - open-telemetry/opentelemetry-specification#4179 INTERNAL spans are de-facto used for things like background in-process jobs processing, web framework controller spans - i.e. those that neither server nor client.
We see in practice that:
As a result, LLM instrumentation cannot rely on HTTP one being there. It should be self-sufficient to some extent which leads to some duplication as you have noticed in spring-projects/spring-ai#1174. This duplication is essential (but also cheap) on metrics for the reasons I described in that issue.
|
Having said this, the only reason to REQUIRE CLIENT kind (vs RECOMMEND it) is to achieve consistency and it's not a super-strong reason. Are there any case where:
I.e. I'm not really opposed to relaxing MUST to SHOULD (for all conventions), but instrumentations should have some very good reasons not to use CLIENT. The reasons I heard so far seem subjective to me. |
One problem I can say is that currently no python probably also no javascript "sdk" I've seen is instrumenting http requests. It may be that they think this is a client and so they don't need to. There are some issues already raised about broken traces. When we water down what client means we add to confusion. We need to rebalance with advice about actually doing the client instrumentation. I think the concept that all things that could result in client should also be client is extremely fragile, and confusing. Easiest way to break it is have a parent result in a client and a producer span. Things like dependency link aggregation manifest as clients and while they can walk around reality bugs such as duplicate instrumentation (eg real duplicate like 2 http instrumentation, not one parent is a rest repository and the other an actual client). Basically, before making such decisions a survey of dependency graph aggregation impact is valid, even if they can work around accidental double-clients today. We are considering changing accident to routine. I feel like this topic has grown its own legs. It is effectively turning all spans that were internal to client, except pure functions. I believe a very wide audience should be considering this carefully, with reality in mind both instrumentation and backend notably aggregation of topology and metrics. That there's a loophole in the spec that allows one to get to all things client, doesn't mean we should. I think more people on this topic actively, those who have impact when it changes or themselves write backends, need to speak up loudly. |
I feel I'm not heard. It's the responsibility of https://github.com/open-telemetry/opentelemetry-js/tree/main/experimental/packages/opentelemetry-instrumentation-http, https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation/opentelemetry-instrumentation-httpx and similar. The CLIENT kind in semantic conventions is not a loophole, but a design decision. There is a general consensus that this is the inevitable direction. UPDATE: also there are quite a few existing instrumentations that generate nested client spans. Yes, it creates issues for application maps - #674 (and we have not heard from anyone but Azure Monitor that it's a problem 🤷♀️), but what I'm saying that it's technically impossible to know whether you're the last one and should be a client. So backends need to find another way to visualize maps. |
When you write a document, you hope that implementations are coherent in the intent of the doc. I have been involved in instrumentation since before CNCF existed, tens years back when everything that wasn't a server was a client. I am aware that http instrumentation is a different responsibility that LLM layer. I do understand deeply what you are saying and am disappointed you feel unheard. There is a difference between spec and reality. For example, when I entered this SIG I found nothing was compatible, and no one tracking that. The reason I can tell you that currently there are no entrypoint SDKs configuring that http instrumentation is because literally I have tried every one. Some add http attributes to the LLM spans, no one is adding instrumenting of the http spans. Please lets not jump to the conclusion that I expect the LLM span to be also HTTP. I am highlighting that despite writing a spec folks aren't doing this. On the topic of leaving things as client, we should acknowledge the reality and clarify the spec. When we write docs we hope it leads somewhere.. currently there isn't evidence most folks are yet led to instrument http except @ThomasVitale who noticed! In summary, I made a documentation clarification requests. I don't want to revisit the internal vs client part anymore, or any other derivation of it in this comment even if you ask me. I would like you to consider that folks are mostly skipping this now, and if we shouldn't add a sentence to improve that. Otherwise, I'll leave this to other people. |
From what I see, OpenLit, Traceloop, Langtrace are not supposed to enable HTTP - they run inside user application and it's the end-user choice to instrument and configure HTTP layer. I feel it's important to clarify it. Happy to learn if you meant something else. |
Apologies for the late comment to this issue. At AGIFlow, we initially supported automatic HTTP instrumentation but decided to remove it. We found that users running Azure Durable Functions experienced excessive tracing noise due to the default networking instrumentation, due to implementation details of Azure Durable Functions. For us, GenAI instrumentation goes beyond observability; it also acts as a data collector for compliance, model improvement, and other purposes. While HTTP instrumentation is valuable, it is not the sole focus of our approach. We understand that some users may prefer to use other backends for observability. In such cases, we provide optional networking instrumentation when initializing the LLM instrumentation library. IMHO, the scope of GenAI instrumentation is a bit wider than only for observability; as platforms like OpenLit, Traceloop, and Langtrace also support evaluation and additional functionalities alongside observability and metrics. |
One last thing I'd like to add to the conversation is about those use cases where LLM operations are not actually relying on remote calls. Instead, the model inference service can run in process. For example, Python apps using the Hugging Face Optimum library or Java apps using the ONNX Runtime library to enable running models in the same process. Would those operations running fully in-process be "client" or "internal"? |
good question. Do you think all other semantics would apply? Would it be reasonable to still report I assume So I'd say that the answer is that INTERNAL would be the most applicable option in this case. And I do think (#1315 (comment)) we can relax MUST to SHOULD. In-process model is quite a good reason to violate SHOULD. But I think we should relax it across all relevant semconv. I added this issue to messaging and db projects and we'll discuss it there as well. |
INTERNAL
GenAI spans instead of requiring the kind to be CLIENT
INTERNAL
GenAI/messaging/db spans instead of requiring the kind to be CLIENT
I think the GenAI spans should be INTERNAL because they are produced by a lib. Yes, it communicates, but I imagine it will use a client lib of some sort to send/receive data... Only communication clients should use CLIENT. |
could you please elaborate:
|
Not sure it would be a problem, just a miss labeling.
If the instrumentation is disabled, it would behave like any other system, the CLIENT span would not show up. It is possible to have SERVER span without a CLIENT one.
Please define disabled. How would it be disabled?
We have that on Quarkus, particular spans will not be generated and in that case, propagation will break because the CLIENT span code is not executed... This is not a sample out, but a non creation. |
We probably need examples to decide. If it does application level logic deciding what to send, it doesn't look like a simple CLIENT, therefore, INTERNAL seems a better fit. |
OTel now supports disabling instrumentation via configuration by tracer name.
How would you decide for grpc - in some cases it runs on top of HTTP (.NET) and both layers can be instrumented. I think I'm trying to make a point that CLIENT/INTERNAL distinction is very vague to the point it's not even helpful. |
INTERNAL
GenAI/messaging/db spans instead of requiring the kind to be CLIENT
INTERNAL
GenAI/db spans instead of requiring the kind to be CLIENT
Fixes #3172 (Built on top of #4088) ## Changes - Explains kinds without assuming presence of parent/children - Adds links as another correlation mechanism - Normalizes nested client spans even further - database, messaging, RPC, and LLM semantic conventions require CLIENT kind for logical client operation. - Does not touch INTERNAL kind yet - #4179 * [x] Related issues #3172, open-telemetry/semantic-conventions#674, open-telemetry/oteps#172, open-telemetry/semantic-conventions#1315 * ~~[ ] Related [OTEP(s)](https://github.com/open-telemetry/oteps) #~~ * ~~[ ] Links to the prototypes (when adding or changing features)~~ * [x] [`CHANGELOG.md`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/CHANGELOG.md) file updated for non-trivial changes * ~~[ ] [`spec-compliance-matrix.md`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/spec-compliance-matrix.md) updated if necessary~~ --------- Co-authored-by: Tigran Najaryan <[email protected]> Co-authored-by: Yuri Shkuro <[email protected]> Co-authored-by: Trask Stalnaker <[email protected]>
Fixes open-telemetry#3172 (Built on top of open-telemetry#4088) ## Changes - Explains kinds without assuming presence of parent/children - Adds links as another correlation mechanism - Normalizes nested client spans even further - database, messaging, RPC, and LLM semantic conventions require CLIENT kind for logical client operation. - Does not touch INTERNAL kind yet - open-telemetry#4179 * [x] Related issues open-telemetry#3172, open-telemetry/semantic-conventions#674, open-telemetry/oteps#172, open-telemetry/semantic-conventions#1315 * ~~[ ] Related [OTEP(s)](https://github.com/open-telemetry/oteps) #~~ * ~~[ ] Links to the prototypes (when adding or changing features)~~ * [x] [`CHANGELOG.md`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/CHANGELOG.md) file updated for non-trivial changes * ~~[ ] [`spec-compliance-matrix.md`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/spec-compliance-matrix.md) updated if necessary~~ --------- Co-authored-by: Tigran Najaryan <[email protected]> Co-authored-by: Yuri Shkuro <[email protected]> Co-authored-by: Trask Stalnaker <[email protected]>
Area(s)
area:gen-ai
Is your change request related to a problem? Please describe.
The current conventions for GenAI spans require the span kind to be
CLIENT
for GenAI applications interacting with model providers.Interactions with model providers rely on HTTP APIs, for which Semantic Conventions exist already. Production applications typically have already an established HTTP infrastructure in their architecture, which often include also instrumentation and observability already. Client libraries to interact with GenAI model providers can build on top of the existing HTTP infrastructure. For example, an OpenAI Client library would delegate to an existing (and already used in production) HTTP library to perform the actual HTTP calls. When that happens, the HTTP calls are already instrumented and observed.
When instrumenting such a GenAI Client, the result would be a span for the GenAI client and a nested/child span for the HTTP interaction (GenAI => HTTP). The GenAI part is a wrapper around HTTP, and it's probably worth maintaining that distinction instead of mixing up the two different concerns. However, in that case, the child span (HTTP) would be of kind
CLIENT
. And in order to follow the Semantic Conventions, the parent span (GenAI) would also have to be of kindCLIENT
, but that's confusing and possibly not one of the "complex scenarios" mentioned in the OpenTelemetry Traces Spec:Even just looking at the definition for the kind
CLIENT
, it seems wrong having the GenAI span asCLIENT
, even though it's the child HTTP span that ultimately describes "a request to some remote service".Describe the solution you'd like
If I understood correctly and it would not be correct to have two nested spans (GenAI => HTTP) both with kind
CLIENT
, then my suggestion would be to change the requirement for GenAI spans to allow bothINTERNAL
andCLIENT
kinds, letting the implementers decide on which one to use based on the library they are instrumenting.If the library doesn't use an already-instrumented HTTP library, then the kind would be
CLIENT
. Example: OpenAI Python library.If the library relies on an already-instrumented HTTP library, then the kind would be
INTERNAL
. Example: OpenAI Java library in Spring Boot and Quarkus.Once we acknowledge the common scenario with two nested spans in production applications (GenAI => HTTP), it's also worth pointing out that some of the span attributes and metric attributes included in the Semantic Conventions for GenAI would not be used when the GenAI span is of kind
INTERNAL
(e.g.server.port,
server.address
,error.type
), because they would belong to the child HTTP span.Describe alternatives you've considered
No response
Additional context
I mainly work with Java. The two main Java frameworks (Spring Boot and Quarkus) both come with existing instrumentation and observability for the HTTP infrastructure, which is the foundation on top of which libraries like Spring AI and Quarkus LangChain4j are built.
I'm working on instrumenting Spring AI based on the Semantic Conventions for GenAI. Currently, the GenAI spans are of kind
INTERNAL
and the HTTP spans are of kindCLIENT
. Example of trace for a GenAI library that uses an already instrumented (and used in production) HTTP library:The text was updated successfully, but these errors were encountered: