You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the key features to understanding and debugging production code is identifying sources of issues, from errors to latency. It would be great if the "key" steps of Postgraphile's execution model could be exposed as out of the box metrics that could be switched on and ingested by popular metric stacks, eg: Open Telemetry / Prometheus.
Motivating example
We have noticed high latency in some requests - yet the database reports low utilisation and relatively quick response times. We'd like to identify where our bottleneck is.
Ideally there are some "significant events" that happen in the lifecycle of a request which we can measure and understand better. Perhaps:
Planning (internal vs plugins): exposing the relative latency custom plugins add to the system in the planning phase, as well as significant planning steps in the pipeline of processing a request. This would help engineers more easily track down any slowness incurred by custom functionality / or just better understanding the planning model and where usage patterns are not ideal.
Execution (I/O latency): exposing the async steps which reach out to the database / and perhaps custom resolver steps would be very helpful in identifying where things might be going slow. This would help engineers identify whether there's a misconfiguration with the connection pooling / or general networking overhead.
Response (might be considered part of the former): exposing response mapping and validation timing would be useful to correlate large requests. Anecdotally, response validation has often caused performance degradations in my experience and is largely symptomatic of a pathological requests.
Supporting development
am interested in building this feature myself
am interested in collaborating on building this feature
am willing to help testing this feature before it's released
am willing to write a test-driven test suite for this feature (before it exists)
It is now possible to build this kind of integration via the new "middlewares" system introduced in #2071. We can add more middleware positions over time, but I think we have the key bases you mentioned covered.
I've written an early telemetry plugin using the middleware system combined with OpenTelemetry, I've sent you an invite and once accepted it can be accessed here: https://github.com/graphile-pro/telemetry - please let me know how you get on with your testing!
Feature description
One of the key features to understanding and debugging production code is identifying sources of issues, from errors to latency. It would be great if the "key" steps of Postgraphile's execution model could be exposed as out of the box metrics that could be switched on and ingested by popular metric stacks, eg: Open Telemetry / Prometheus.
Motivating example
Ideally there are some "significant events" that happen in the lifecycle of a request which we can measure and understand better. Perhaps:
Supporting development
The text was updated successfully, but these errors were encountered: