Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce high-level telemetry #2012

Closed
3 of 6 tasks
Nealm03 opened this issue Apr 2, 2024 · 1 comment
Closed
3 of 6 tasks

Produce high-level telemetry #2012

Nealm03 opened this issue Apr 2, 2024 · 1 comment

Comments

@Nealm03
Copy link

Nealm03 commented Apr 2, 2024

Feature description

One of the key features to understanding and debugging production code is identifying sources of issues, from errors to latency. It would be great if the "key" steps of Postgraphile's execution model could be exposed as out of the box metrics that could be switched on and ingested by popular metric stacks, eg: Open Telemetry / Prometheus.

Motivating example

  • We have noticed high latency in some requests - yet the database reports low utilisation and relatively quick response times. We'd like to identify where our bottleneck is.

Ideally there are some "significant events" that happen in the lifecycle of a request which we can measure and understand better. Perhaps:

  • Planning (internal vs plugins): exposing the relative latency custom plugins add to the system in the planning phase, as well as significant planning steps in the pipeline of processing a request. This would help engineers more easily track down any slowness incurred by custom functionality / or just better understanding the planning model and where usage patterns are not ideal.
  • Execution (I/O latency): exposing the async steps which reach out to the database / and perhaps custom resolver steps would be very helpful in identifying where things might be going slow. This would help engineers identify whether there's a misconfiguration with the connection pooling / or general networking overhead.
  • Response (might be considered part of the former): exposing response mapping and validation timing would be useful to correlate large requests. Anecdotally, response validation has often caused performance degradations in my experience and is largely symptomatic of a pathological requests.

Supporting development

  • am interested in building this feature myself
  • am interested in collaborating on building this feature
  • am willing to help testing this feature before it's released
  • am willing to write a test-driven test suite for this feature (before it exists)
  • am a Graphile sponsor ❤️
  • have an active support or consultancy contract with Graphile
@github-project-automation github-project-automation bot moved this to 🌳 Triage in V5.0.0 Apr 2, 2024
@benjie benjie moved this from 🌳 Triage to 🦉 Owl in V5.0.0 Apr 2, 2024
@benjie
Copy link
Member

benjie commented May 23, 2024

It is now possible to build this kind of integration via the new "middlewares" system introduced in #2071. We can add more middleware positions over time, but I think we have the key bases you mentioned covered.

I've written an early telemetry plugin using the middleware system combined with OpenTelemetry, I've sent you an invite and once accepted it can be accessed here: https://github.com/graphile-pro/telemetry - please let me know how you get on with your testing!

@benjie benjie closed this as completed May 23, 2024
@github-project-automation github-project-automation bot moved this from 🦉 Owl to ✅ Done in V5.0.0 May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants