Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit header name -- Server-Timing vs. traceresponse #556

Open
jpkrohling opened this issue Jan 10, 2024 · 10 comments
Open

Revisit header name -- Server-Timing vs. traceresponse #556

jpkrohling opened this issue Jan 10, 2024 · 10 comments

Comments

@jpkrohling
Copy link

Related to #69, I would like to reopen the discussion around the header name for the client propagation of server tracing information.

The current state of the art among practitioners is to use the Server-Timing header, which is part of a safe-list of browsers today. A new header, such as traceresponse, would require a lot of effort to get included in those lists and take a long time before this is ubiquitous among client devices.

The linked issue was closed stating that it was decided against using Server-Timing, but without giving a reason for that. As I mentioned on that issue, by looking at the minutes, I could guess that the reason is related to this comment:

Yoav: This should be opt in, with the bare minimum number of resources that you need.

If that's the concern, isn't the server side already opting in by adding the response metric to this header? Like:

Server-Timing: traceresponse;desc=00-{trace-id}-{child-id}-01

If there's no other reason, I would like to propose a change to the current draft, so that the traceresponse isn't a header, but a metric of the Server-Timing header. This way, we can co-exist with other competing standards and offer a lower-friction migration path to people using Server-Timing today.

References:

@basti1302
Copy link
Contributor

@dyladan
Copy link
Member

dyladan commented Jan 30, 2024

@jpkrohling thanks for bringing this back up. I've reached out to a few people who were involved in the initial discussion to see if they can provide more context as to why server-timing was rejected before, since nobody currently involved in this group was around at that time. We also discussed this in the meeting last week. Here is a quick summary of what we discussed:

  • This is the solution already implemented by many modern APM vendors
  • Already implemented in many modern browsers https://caniuse.com/server-timing
    • roughly 75% of users tracked by caniuse.com
    • no iOS support
    • insufficient safari desktop support (only available to network inspector, not JS API)
  • In 2018 when Investigate Server-Timing for passing back response headers #69 was closed, support was in chrome only and was behind an experiment flag
  • server-timing is limited to same-origin policy unless specified otherwise using timing-allow-origin
    • even if we define our own header, it is likely we would have similar restrictions in order to appease browser vendors
  • server-timing is restricted to secure contexts
  • The use seems to go against the intended use case for server-timing. trace ids are not a "timing" or a "metric"
  • server-timing is still a draft spec, and we are not sure if it is ok for us to build another spec on top of it while it is draft

Regarding the last 2 items, we intend to discuss these with the server-timing specification editors to see if they are actually a problem or not.

@dyladan
Copy link
Member

dyladan commented Jan 30, 2024

Met with Yoav from the web performance group today about this. Notes from the meeting:

  • There has not been a strong demand from the community for the specification to become stable. If it is a concern, we can push forward on it. The person most responsible for driving the spec has moved on.
  • Sergey: stability helped with discussions with .NET for trace context
  • Bastian: if we want to become stable, we would want to rely on stable specifications (at least CR)
  • Yoav: likely there would not be strong objections
  • There has not been wide review for the spec yet
  • Sergey: would the web performance group have any specific objection to using server-timing for server use cases
  • Yoav: the header is optimized for timing metrics, but has been used for others things quite a bit. There is no specific objection to that. If there is a limitation, the API can be expanded.
  • Yoav: server to server seems perfectly fine and unrelated to the web API other than IANA registration
  • Sergey: Are there any well known non-timing use cases?
  • Yoav: not aware of that but it is possible
  • Dan: is there anything we need to be aware of for browser support? iOS and safari support is not yet at the point where it is useable
  • Yoav: webkit implementation is behind a flag since 2018. Privacy and Security section should mention that a user agent may choose to not expose cross-origin PerformanceServerTiming entries even with TAO server-timing#89 They are concerned about server-timing across origins. They wanted to block ST from cross-origin responses even when the server uses the timing-allow-origin header or CORS opt-in. A recent version of the spec allows this exemption. They also want to block resource timing across origins. They have said they would enable it for same-origin. Browser implementations are slightly different because firefox supports trailers and firefox does not.
  • Merging multi-headers and trailers is something we need to consider. Should not be a problem as long as the metric names are different among the different headers/trailers
  • Semantics: Server-Timing is too limited in scope, rename to Server-Metrics · Issue #77 · w3c/server-timing (github.com)
  • PLH: is it possible for us to reserve a metric name for our specific use?
  • Yoav: open to it, but don’t see a strong need
  • PLH: how can we discover which metric names are already in use?
  • Yoav: suggests HTTP archive as one data source to look for conflicts
  • Sergey: Is there a recommendation for how proxies and load balancers should handle the server-timing header?
  • Yoav: there are some open issues but nothing resolved
  • Kalyana: how difficult was it to convince browser to implement?
  • Yoav: did this when at akamai. At that time it was specified in chromium and webkit, and firefox followed suit. This was pushed for by the CDN and browser delivery ecosystem
  • Kalyana: from w3c perspective, how much pushback would you expect from w3c review on piggy-backing on server-timing?
  • PLH: if something comes up, it would be with the server-timing header itself not with our use of it

@kalyanaj
Copy link
Contributor

kalyanaj commented Jan 31, 2024

Thanks @dyladan for the above notes and summary. As discussed in the W3C DT working group meeting, can we discuss the evaluation criteria (ideally, we should rank them in terms of the most important to least important) and then score these options for those criteria? Please let me know if you prefer a different approach.

Here's my initial attempt at the list of ranked criteria & how these two options (Update: adding a third option for discussion) meet those criteria. Please feel free to edit the below contents directly so that we can collaboratively close on this list:

  • [Must be standards based] The mechanism must be (or have a path to be) an official W3C standard.
  • [Trace context propagation from callees to callers] The mechanism must enable propagating traceid, callee span id, flags (sampled flag, random traceid flag, any other future flags) from callees to callers.
  • [Supported by browsers] The mechanism must have wide support in different browser implementations, so that the above trace context information can be used (e.g., DT for file load) or any browser related DT use cases.
  • [Supported for server to server use cases] The mechanism must support use for server (callee) to server (caller) trace context propagation.
  • [Must be extensible in the future] The mechanism must be extensible to support future needs in a backwards compatible manner (e.g., using a version field).
  • [Semantically clean] The mechanism must cleanly fit with the trace context semantics.
  • [Reasonably simple to implement] The mechanism must not be unduly complex for implementations.

Are we missing any other major criteria for decision making? Should we add the ones about the proxies/loadbalancers handling?

Here's an attempt at scoring these two options for the above criteria.

Criteria Traceresponse header Using server-timing header USE BOTH!: Traceresponse header for the most part + use server-timing only for initial page load by browsers
Must be standards based Yes (there's a path) Yes (there's a path) Yes (there's a path)
Trace context propagation from callees to callers Yes Yes Yes
Supported by browsers No (complex to gain adoption) Yes Yes
Supported for server-server Yes Yes Yes
Must be extensible in the future Yes Yes Yes
Semantically clean Yes No (arguable) Yes
Reasonably simple to implement Yes Yes TBD

Thoughts? Please feel free to edit directly the list & table above.

@kalyanaj
Copy link
Contributor

kalyanaj commented Jan 31, 2024

Also, I am looking to understand better:

  • the use cases where browser support is needed for traceresponse header, and...
  • if those necessitate ranking this criterion higher than other criteria (such as all non-browser scenarios & semantic cleanliness).

Is it the file load scenario: where a call is made to download a file and a trace id is returned as part of the response and the browser needs to continue that trace id for the remaining work?

What are the other interesting use cases? Looking to learn more to improve my understanding of the browser side DT / traceresponse use cases.

@yurishkuro
Copy link
Member

I have a somewhat made-up scenario (from a demo app) - the UI reads the traceID from the response header and uses it to display a link to the trace for the previous action. https://github.com/jaegertracing/jaeger/blob/e08f576fd64a992ef0396112bc8401472cc9dd92/examples/hotrod/services/frontend/web_assets/index.html#L109

@kalyanaj
Copy link
Contributor

kalyanaj commented Feb 1, 2024

Added a third option to the above table (keep Traceresponse header but use server-timing header only when returning to browsers) for discussion.

This is based on the assumptions that:

  • a new header (traceresponse) may not be pragmatic for the initial page load use cases.
  • however, for other requests (within CORS rules), any headers (including traceresponse) can be sent/received.

If the above assumptions are true (I could be wrong here - not a browser expert) & if there's a way to disambiguate initial page load, then this option maybe worth discussing. Including this option to avoid narrow framing and to widen our options for discussion.

@jpkrohling
Copy link
Author

@kalyanaj , about the use-cases where browser support is needed, I believe that @cedricziel can elaborate on that, but here's some more information and context: open-telemetry/opentelemetry-specification#3811 (comment)

It's basically the same case that @yurishkuro mentioned before, the only change being that browser-based telemetry tools (like Grafana Faro) can use this header to create span links between frontend and backend traces.

@basti1302
Copy link
Contributor

Also, I am looking to understand better: the use cases where browser support is needed for traceresponse header, and...

Another use case is a customer support scenario. When a page load fails, having the trace context of that failed page load available in the browser enables showing the trace context on the error page or in automatic ticket creation. Customer support folks can then use the trace id to check the observability tooling to get more information about the failure.

But I believe being able to link the initial page load to the server side trace in the data the client side instrumentation sends to the observability backend is the most relevant use case.

That plus what Ben says here: open-telemetry/opentelemetry-specification#3811 (comment) -- even for requests other than the initial page load (XHR/fetch), using a custom header creates same-origin policy issues. (Yoav pointed out that cross-origin might still be an issue in Safari with Server-Timing, but in general the situation with respect to cross-origin is already a lot better with Server-Timing compared to custom headers.)

@gredler
Copy link

gredler commented Nov 4, 2024

Are there any updates on this topic? I see that the Level 3 draft published to the website still uses traceresponse. However, migration sounded pretty likely here: open-telemetry/opentelemetry-specification#3811 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants