Distinguish failing CI runs from CI errors #328

shonfeder · 2024-06-26T02:09:35Z

A CI pipeline can fail because a build or test step returns a negative result, or because some error in the CI pipeline logic. We currently don't distinguish these outcomes in most cases. Ignoring this distinction has the following known down-sides:

Users will see job failures, and don't learn that the CI is suffering from internal errors until they inspect the logs.
While we collect metrics on the number of failed CI jobs, we cannot differentiate these from errors that would indicate sporadic or pervasive failures in the infrastructure and services.
We have no way of restarting jobs that failed due to an infrastructure error after that error has been repaired.

(2) bit us last week, when we failed to detect ocaml/infrastructure#128 until it was so wide spread that users where noticing the failures. If we had monitoring that alerted us to errors in the pipeline, we could have seen this coming much earlier. Incorporating an error status into the metrics sent to Grafana would make these failures clearly visible.

To address (1), we can send an error status to GitHub. The current API supports the following statuses:

error, failure, pending, success

But, iiuc, we don't use the error status in our reporting:

opam-repo-ci/service/github.ml

Lines 21 to 23 in 97d42b7

    
           | _, Ok m              -> Api.Status.v ~url `Success ~description:("Passed - "^m) 
        
           | _, Error (`Active _) -> Api.Status.v ~url `Pending 
        
           | _, Error (`Msg m)    -> Api.Status.v ~url `Failure ~description:("Failed - "^m)

To address (2), we should send error results to Grafana:

opam-repo-ci/service/metrics.ml

Lines 34 to 37 in 97d42b7

    
           Gauge.set (master "not_started") (float_of_int n_per_status.not_started); 
        
           Gauge.set (master "pending") (float_of_int n_per_status.pending); 
        
           Gauge.set (master "failed") (float_of_int n_per_status.failed); 
        
           Gauge.set (master "passed") (float_of_int n_per_status.passed);

To address (3), we start by recording failures in the job index, which doesn't currently differentiate failures from errors:

opam-repo-ci/lib/index.ml

Line 114 in 97d42b7

type build_status = [ `Not_started | `Pending | `Failed | `Passed ]

The text was updated successfully, but these errors were encountered:

hannesm · 2024-06-26T09:07:57Z

I propose to rename "error" to "internal error" or "internal CI error". And once this build status has been achieved, maybe an automatic restart of the job should be scheduled (surely with an exponential backoff to avoid CI going crazy)? Same could be done for "ocaml-ci".

shonfeder added the capacity+ Work that will increase our capacity (by reducing workload or increasing our efficacy) label Jun 26, 2024

shonfeder mentioned this issue Aug 28, 2024

opensuse builders have failures ocurrent/ocaml-ci#967

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish failing CI runs from CI errors #328

Distinguish failing CI runs from CI errors #328

shonfeder commented Jun 26, 2024

hannesm commented Jun 26, 2024

Distinguish failing CI runs from CI errors #328

Distinguish failing CI runs from CI errors #328

Comments

shonfeder commented Jun 26, 2024

hannesm commented Jun 26, 2024