You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A CI pipeline can fail because a build or test step returns a negative result, or because some error in the CI pipeline logic. We currently don't distinguish these outcomes in most cases. Ignoring this distinction has the following known down-sides:
Users will see job failures, and don't learn that the CI is suffering from internal errors until they inspect the logs.
While we collect metrics on the number of failed CI jobs, we cannot differentiate these from errors that would indicate sporadic or pervasive failures in the infrastructure and services.
We have no way of restarting jobs that failed due to an infrastructure error after that error has been repaired.
(2) bit us last week, when we failed to detect ocaml/infrastructure#128 until it was so wide spread that users where noticing the failures. If we had monitoring that alerted us to errors in the pipeline, we could have seen this coming much earlier. Incorporating an error status into the metrics sent to Grafana would make these failures clearly visible.
To address (1), we can send an error status to GitHub. The current API supports the following statuses:
error, failure, pending, success
But, iiuc, we don't use the error status in our reporting:
I propose to rename "error" to "internal error" or "internal CI error". And once this build status has been achieved, maybe an automatic restart of the job should be scheduled (surely with an exponential backoff to avoid CI going crazy)? Same could be done for "ocaml-ci".
A CI pipeline can fail because a build or test step returns a negative result, or because some error in the CI pipeline logic. We currently don't distinguish these outcomes in most cases. Ignoring this distinction has the following known down-sides:
(2) bit us last week, when we failed to detect ocaml/infrastructure#128 until it was so wide spread that users where noticing the failures. If we had monitoring that alerted us to errors in the pipeline, we could have seen this coming much earlier. Incorporating an
error
status into the metrics sent to Grafana would make these failures clearly visible.To address (1), we can send an error status to GitHub. The current API supports the following statuses:
But, iiuc, we don't use the
error
status in our reporting:opam-repo-ci/service/github.ml
Lines 21 to 23 in 97d42b7
To address (2), we should send error results to Grafana:
opam-repo-ci/service/metrics.ml
Lines 34 to 37 in 97d42b7
To address (3), we start by recording failures in the job index, which doesn't currently differentiate failures from errors:
opam-repo-ci/lib/index.ml
Line 114 in 97d42b7
The text was updated successfully, but these errors were encountered: