-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FaaS: Change requirements regarding handling of AWS Lambda-provided SpanContext
#272
Conversation
…`SpanContext Signed-off-by: Anthony J Mirabella <[email protected]>
15384ec
to
07435bc
Compare
Please adapt the PR title, since this is more than "clarification" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, I assumed the Link vs. Parent debate was one where the community wishes to move towards linking as the default, and we're providing configuration to aide in the transition for existing x-ray users.
This does not appear to be that direction, nor is this just a clarification of wording. Would like to hear from others in the FAAS group before allowing this to be merged.
[start options](https://github.com/open-telemetry/opentelemetry-specification/tree/v1.22.0/specification/trace/api.md#specifying-links) with an associated `String`-typed | ||
attribute with a key of `source` and a value of `x-ray-env`, in order to indicate the source of the linked span. | ||
|
||
Instrumentation MUST provide a mechanism for the user to select the behavior that they require it to utilize. In the absence of an expressed user preference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence flips the previous default setting. Is that intended / agreed upon by the FAAS SiG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the current state of most Lambda instrumentation. It is preserving the functioning behavior for users who had functioning instrumentation while clarifying that it is permissible to have an alternate behavior. I feel that this is a clarification as the spec as written permits this behavior and that was not apparent to all in the discussion during the last FaaS SIG meeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Aneurysm9 Java and the semconv docs are not inline with this, and we would be indeed flipping the current default in this repo.
As said before: having this as default as direct parent will work great with X-Ray but will break the other vendors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said before: having this as default as direct parent will work great with X-Ray but will break the other vendors.
I do not believe that this is a correct interpretation of the situation. Having the default behavior to be to use the value as a parent SpanContext
will preserve the existing behavior for for current users in all languages but Java. It will not break any users. Users who require a different behavior will be able to select that behavior.
Java and the semconv docs are not inline with this, and we would be indeed flipping the current default in this repo.
Java implemented a spec change that was incomplete and acknowledged to require changes. This is the semconv repo and this change is fixing the "semconv docs". In any event, this is "experimental" specification and subject to change. Is there a special process for such changes I'm unaware of that I should be following?
attribute with a key of `source` and a value of `x-ray-env`, in order to indicate the source of the linked span. | ||
|
||
Instrumentation MUST provide a mechanism for the user to select the behavior that they require it to utilize. In the absence of an expressed user preference | ||
instrumentation SHOULD utilize the extracted `Context` as the parent context for the `Invocation Span`. Instrumentation MAY require that the user explicitly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to allow explicit configuration as a requirement? I'd prefer if we could all agree on a default behavior here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, explicit configuration is a requirement. The Lambda-provided context does different things for different people and they need a mechanism for communicating their intent to the instrumentation.
I believe, as noted above, that this is a clarification of the intent behind the initial attempt to resolve open-telemetry/opentelemetry-specification#3060. If you disagree, please provide alternate titles that would comport with your understanding of this change. |
|
||
**Note**: When instrumenting a Java AWS Lambda, instrumentation SHOULD first try to parse an OpenTelemetry `Context` out of the system property `com.amazonaws.xray.traceHeader` using the [AWS X-Ray Propagator](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/context/api-propagators.md) before checking and attempting to parse the environment variable above. | ||
* as the parent `Context` | ||
* as a `SpanContext` associated with a [Span Link][] added to the `Invocation Span`'s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this SpanContext is used as a span link instead of the parent, it won't work with ParentBased samplers like the OTel default ParentBased(root=AlwaysOn)
.
Does AWS provide head sampling through this SpanContext?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Head sampling can be provided through this SpanContext. It is my understanding that that is irrelevant to the users who wish to have it associated as a link because they are not using X-Ray as their tracing backend and do not desire to have a trace including the Lambda-vended segments. I believe the expectation is that those users will obtain a SpanContext from another source, such as HTTP headers included in the invocation event, and would want to use that SpanContext for any sampling decisions.
@Aneurysm9 Simplest adaption of PR title would be "FaaS: |
… into fix/lambdaSpanContext
Signed-off-by: Anthony J Mirabella <[email protected]>
SpanContext
SpanContext
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporarily blocking as these changes are actually breaking the current semconv documents, changing once again the primary role of the context in the AWS specific environment variable.
@carlosalberto I don't personally agree with this change, but I also don't agree with blocking it for this reason, and I think we should let discussion continue to play out here and in open-telemetry/opentelemetry-specification#263 @Aneurysm9 my personal objection to this change is that it prioritizes the default experience for one monitoring backend (X-Ray) at the expense of the default experience for other monitoring backends, which doesn't feel in line with OpenTelemetry's vision. (btw I personally support an opt-in behavior to support X-Ray in this case) |
I agree with @trask's objection to this change. While I think there are better solutions I'm not opposed to adding a setting to prefer the x-ray env context over remote context as long as the default is to use the remote. By default, traces reported via OTLP should be complete, regardless of if x-ray is enabled or not. |
From the vision document for clarity:
I believe that there is an expectation inherent in this objection that using the AWS Lambda vended spans as a parent context is only relevant to users of AWS X-Ray. This is not the case. It is possible to utilize the AWS X-Ray trace context propagation format without using AWS X-Ray as a trace backend. It is possible to do so in a way that would cause AWS Lambda to participate in that trace, vending segments to AWS X-Ray that can be obtained by the user and migrated to the trace backend of their choice. There is nothing in this change that would result in any user being locked to any vendor. There is nothing in this change that would preclude interoperability with any other system. The reasons I feel it is appropriate for the default behavior to be to utilize the AWS Lambda vended context as a parent are that 1) this is the current behavior for all AWS Lambda instrumentation with the exception of the Java instrumentation and 2) this is instrumentation of an AWS service and it is reasonable to utilize the service-provided context propagation mechanism as the default. |
@Aneurysm9 the issue here isn't the propagation mechanism. X-Ray propagation is fine and I even submitted spec changes to formalize that further. Our issue from the beginning has always been the forked context that passes through AWS services, creating spans that aren't reported in the same manner (to the same backend) as the rest of the spans within the trace. Such behavior results in broken traces which is unacceptable. |
@Aneurysm9 I'd like to understand this better. Could you link to docs for how e.g. a Jaeger user would set this up? thx |
Now I'm confused. Is the issue the propagation mechanism, or is it not the propagation mechanism? Are broken traces acceptable or not? Is this situation (a distributed trace passes through a system that is not in control of the application owner, resulting in spans reported to a trace backend that is not in the control of the application owner) not common to all distributed tracing backends and instrumentation? This is such a common situation that the specification requires that APIs must accept "[t]he parent I also do not believe that there has ever been a situation where users of tracing backends other than AWS X-Ray have been wholly unable to create complete traces for interactions involving AWS Lambda function invocations. To the extent that their preferred parent context is carried by the invocation event and they have access to a |
(Removing my changes request as per @trask's feedback) |
IIUC - The key user issue we are debating (and the one we're arguing to prioritize) is as follows: Scenario 1
How should we attach traces to give users the best o11y here? A few things I'm assuming:
In this case, the ONLY way to not have a broken trace is to "link" against the AWS x-ray trace vs. parent against it. I believe this SHOULD be the default, including apply this to GCP services that have the same problem. This allows users to define their own "security boundaries" of what propagation they allow without being bound to what we, as cloud providers, require for trace_id generation. Scenario 2There are users who will leverage OOTB trace sampling and generation of their cloud provider. GCP, e.g. provides tracing and sampling of Cloud Run as a free option. Users may want to track their traces at this "entry point" level. This works when a trace is not going through one of these security / trace_id ownership boundaries between services for that user. I agree there should be a way for users to get what they need here. The big questionWhich scenario should be the default / more likely for users. It was my understanding that Scenario 1 would be the new default behavior, with Scenario 2 requiring user configuration, but achievable. A few comments from the linked issue that lead to me raising concerns:
This last one coming from this pr, which was approved and merged. |
The context provided by AWS Lambda in the execution environment is a remote context. The question is which of the remote contexts to use when multiple are present and is complicated by the fact that the specification recommends specific sources of context carriers be used. I have tried to minimize the change to the current specification here to avoid getting into issues of separation of responsibilities between instrumentation author, instrumentation user, instrumentation implementations, context propagators, and user applications. I'm not opposed to having that discussion again as long as we stop pushing changes that will break users for whom the instrumentation is presently functioning.
I'm not sure where OTLP entered the picture here and this is not something that any instrumentation can guarantee precisely because distributed tracing involves interacting with systems outside the control of the instrumented system. Regardless which serialization format and network protocol is used to communicate trace data an incomplete trace will be experienced if one participant in the trace sends data to one backend and another participant sends data to a different backend. I will let the AWS Lambda team handle your product feature request, but I will note here for posterity that it was made five days ago in a repository relevant to container images useful to AWS Lambda users who use the container packaging format for their functions. This is likely not the correct forum for such a product feature request. |
just checking my maths, doesn't the same problem exist even if the same traceId is used, since you still have to pick one of the spanIds to parent? |
@trask That's true. However if the same trace_id is used, I think it's less likely to cause broken traces, but would cause missing spans. Depends on the backend implemetnation at that point. However, we're starting to sidestep the main point. Instead of repeating, I'll link my more succinct comment: #263 (comment) |
@Aneurysm9 I understand that is likely not the most relevant location for such but it was the place that @rapphil recommended I provide such feedback and it seemed like a reasonable place to start such a conversation in public. If you have a better forum for that conversation, let me know and I'll be happy to take that suggestion elsewhere. |
Please do not ascribe "demand"s to me that I have not made and please do not make assertions about my mental state that you cannot know. I have indicated that the suggestion on open-telemetry/opentelemetry-specification#263 does not work for me, for a variety of reasons that have since been elaborated, and I have suggested an alternative. |
For the case where the AWS service in question is a messaging system (SNS, SQS), this maps to scenarios extensively discussed in the messaging SIG, results of this discussion are captured in OTEP 220. TLDR; We always recommend to create a link between Service A and Service B, which denotes a producer/consumer relationship. Depending on variations of the scenarios (batch-processing, or broker instrumentation) there can or cannot be links between A and B, or between A, B, and the broker. However, the link between A and B is an invariant for all scenarios. I'm not familiar with all details of this discussion, so I might be a bit off, however, it seems to me that it is closely related to issue #652. In my opinion, the question of parent vs link is not so much related to propagation as it is related to the question of how traces should look like for the user. |
@tylerbenson While we have had similar troubles with this, it is not really an AWS problem, but a problem that in general will be even more widespread when cloud providers adapt W3C TraceContext headers. Then you won't even be able to switch to W3C propagation as a workaround. You can observe this when trying to monitor google cloud functions. Google, fully W3C conformant, participates in the trace an will thus override your span ID. Multi-tenancy is simply a missing feature from OTel and also seems to be somehow missed in the W3C spec (for a potential solution see open-telemetry/opentelemetry-specification#366 (comment)) |
@Oberon00 I agree and for that reason I suggest we work with cloud providers for easier ways to get access to those internal spans. Last week I submitted an issue on AWS's side: aws/aws-lambda-base-images#111 to see if I could get any reaction from them that doesn't involve just polling their API. |
FYI - TC discussion and opinion on this PR: #277 (comment) |
This change clarifies the requirements regarding the handling of the
SpanContext
vended by the AWS Lambda service to function invocations.