Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish failing CI runs from CI errors #328

Open
shonfeder opened this issue Jun 26, 2024 · 1 comment
Open

Distinguish failing CI runs from CI errors #328

shonfeder opened this issue Jun 26, 2024 · 1 comment
Labels
capacity+ Work that will increase our capacity (by reducing workload or increasing our efficacy)

Comments

@shonfeder
Copy link
Contributor

A CI pipeline can fail because a build or test step returns a negative result, or because some error in the CI pipeline logic. We currently don't distinguish these outcomes in most cases. Ignoring this distinction has the following known down-sides:

  1. Users will see job failures, and don't learn that the CI is suffering from internal errors until they inspect the logs.
  2. While we collect metrics on the number of failed CI jobs, we cannot differentiate these from errors that would indicate sporadic or pervasive failures in the infrastructure and services.
  3. We have no way of restarting jobs that failed due to an infrastructure error after that error has been repaired.

(2) bit us last week, when we failed to detect ocaml/infrastructure#128 until it was so wide spread that users where noticing the failures. If we had monitoring that alerted us to errors in the pipeline, we could have seen this coming much earlier. Incorporating an error status into the metrics sent to Grafana would make these failures clearly visible.

To address (1), we can send an error status to GitHub. The current API supports the following statuses:

error, failure, pending, success

But, iiuc, we don't use the error status in our reporting:

| _, Ok m -> Api.Status.v ~url `Success ~description:("Passed - "^m)
| _, Error (`Active _) -> Api.Status.v ~url `Pending
| _, Error (`Msg m) -> Api.Status.v ~url `Failure ~description:("Failed - "^m)

To address (2), we should send error results to Grafana:

Gauge.set (master "not_started") (float_of_int n_per_status.not_started);
Gauge.set (master "pending") (float_of_int n_per_status.pending);
Gauge.set (master "failed") (float_of_int n_per_status.failed);
Gauge.set (master "passed") (float_of_int n_per_status.passed);

To address (3), we start by recording failures in the job index, which doesn't currently differentiate failures from errors:

type build_status = [ `Not_started | `Pending | `Failed | `Passed ]

@shonfeder shonfeder added the capacity+ Work that will increase our capacity (by reducing workload or increasing our efficacy) label Jun 26, 2024
@hannesm
Copy link
Contributor

hannesm commented Jun 26, 2024

I propose to rename "error" to "internal error" or "internal CI error". And once this build status has been achieved, maybe an automatic restart of the job should be scheduled (surely with an exponential backoff to avoid CI going crazy)? Same could be done for "ocaml-ci".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
capacity+ Work that will increase our capacity (by reducing workload or increasing our efficacy)
Projects
None yet
Development

No branches or pull requests

2 participants