Produce high-level telemetry #2012

Nealm03 · 2024-04-02T15:49:59Z

Feature description

One of the key features to understanding and debugging production code is identifying sources of issues, from errors to latency. It would be great if the "key" steps of Postgraphile's execution model could be exposed as out of the box metrics that could be switched on and ingested by popular metric stacks, eg: Open Telemetry / Prometheus.

Motivating example

We have noticed high latency in some requests - yet the database reports low utilisation and relatively quick response times. We'd like to identify where our bottleneck is.

Ideally there are some "significant events" that happen in the lifecycle of a request which we can measure and understand better. Perhaps:

Planning (internal vs plugins): exposing the relative latency custom plugins add to the system in the planning phase, as well as significant planning steps in the pipeline of processing a request. This would help engineers more easily track down any slowness incurred by custom functionality / or just better understanding the planning model and where usage patterns are not ideal.
Execution (I/O latency): exposing the async steps which reach out to the database / and perhaps custom resolver steps would be very helpful in identifying where things might be going slow. This would help engineers identify whether there's a misconfiguration with the connection pooling / or general networking overhead.
Response (might be considered part of the former): exposing response mapping and validation timing would be useful to correlate large requests. Anecdotally, response validation has often caused performance degradations in my experience and is largely symptomatic of a pathological requests.

Supporting development

am interested in building this feature myself
am interested in collaborating on building this feature
am willing to help testing this feature before it's released
am willing to write a test-driven test suite for this feature (before it exists)
am a Graphile sponsor ❤️
have an active support or consultancy contract with Graphile

benjie · 2024-05-23T17:09:44Z

It is now possible to build this kind of integration via the new "middlewares" system introduced in #2071. We can add more middleware positions over time, but I think we have the key bases you mentioned covered.

I've written an early telemetry plugin using the middleware system combined with OpenTelemetry, I've sent you an invite and once accepted it can be accessed here: https://github.com/graphile-pro/telemetry - please let me know how you get on with your testing!

github-project-automation bot added this to V5.0.0 Apr 2, 2024

github-project-automation bot moved this to 🌳 Triage in V5.0.0 Apr 2, 2024

benjie moved this from 🌳 Triage to 🦉 Owl in V5.0.0 Apr 2, 2024

benjie closed this as completed May 23, 2024

github-project-automation bot moved this from 🦉 Owl to ✅ Done in V5.0.0 May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Produce high-level telemetry #2012

Produce high-level telemetry #2012

Nealm03 commented Apr 2, 2024 •

edited by benjie

Loading

benjie commented May 23, 2024

Produce high-level telemetry #2012

Produce high-level telemetry #2012

Comments

Nealm03 commented Apr 2, 2024 • edited by benjie Loading

Feature description

Motivating example

Supporting development

benjie commented May 23, 2024

Nealm03 commented Apr 2, 2024 •

edited by benjie

Loading