Merge pull request #3410 from CliMA/ck/repro_tests2

Allow for flakiness in reproducibility tests
CliMA · Nov 7, 2024 · aa735e9 · aa735e9
2 parents 7b6bfab + 3cb5e84
commit aa735e9
Show file tree

Hide file tree

Showing 13 changed files with 379 additions and 231 deletions.
diff --git a/.buildkite/pipeline.yml b/.buildkite/pipeline.yml
@@ -317,6 +317,10 @@ steps:
           julia --color=yes --project=examples examples/hybrid/driver.jl
           --config_file $CONFIG_PATH/sphere_aquaplanet_rhoe_equilmoist_allsky_gw_res.yml
           --job_id sphere_aquaplanet_rhoe_equilmoist_allsky_gw_res
+
+          julia --color=yes --project=examples reproducibility_tests/test_mse.jl --test_broken_report_flakiness true
+          --job_id sphere_aquaplanet_rhoe_equilmoist_allsky_gw_res
+          --out_dir sphere_aquaplanet_rhoe_equilmoist_allsky_gw_res/output_active
         artifact_paths: "sphere_aquaplanet_rhoe_equilmoist_allsky_gw_res/output_active/*"
         agents:
           slurm_mem: 20GB

diff --git a/examples/hybrid/driver.jl b/examples/hybrid/driver.jl
@@ -148,15 +148,15 @@ if ClimaComms.iamroot(config.comms_ctx)
         joinpath(
             pkgdir(CA),
             "reproducibility_tests",
-            "self_reference_or_path.jl",
+            "latest_comparable_paths.jl",
         ),
     )
     @info "Plotting"
-    path = self_reference_or_path() # __build__ path (not job path)
-    if path == :self_reference
+    paths = latest_comparable_paths() # __build__ path (not job path)
+    if isempty(paths)
         make_plots(Val(Symbol(reference_job_id)), simulation.output_dir)
     else
-        main_job_path = joinpath(path, reference_job_id)
+        main_job_path = joinpath(first(paths), reference_job_id)
         nc_dir = joinpath(main_job_path, "nc_files")
         if ispath(nc_dir)
             @info "nc_dir exists"

diff --git a/reproducibility_tests/README.md b/reproducibility_tests/README.md
@@ -18,14 +18,37 @@ Our solution to dealing with failure modes is by providing users with two workfl
    - [Update mse tables](#How-to-update-mse-tables)
 
  - A comparable reference dataset does **not** exists:
-   - Increment the reference counter in `reproducibility_tests/ref_counter.jl`. This triggers a "self-reference".
+   - Increment the reference counter in `reproducibility_tests/ref_counter.jl`.
    - [Update mse tables](#How-to-update-mse-tables) _all to zero values_
 
 At this moment, it's crucial to mention several important points:
 
- - When a reference dataset does not exist, we still perform a reproducibility test so that we continuously exercise the testing infrastructure. However, we compare the solution dataset with itself (which we call a "self-reference"). Therefore, _all reproducibility tests for all jobs will pass_ (no matter what the results look like) when the reference counter is incremented. So, it is important to review the quality of the results when the reference counter is incremented.
+ - When a reference dataset does not exist, we still perform a reproducibility test so that we continuously exercise the testing infrastructure. However, we compare the solution dataset with itself. Therefore, _all reproducibility tests for all jobs will pass_ (no matter what the results look like) when the reference counter is incremented. So, it is important to review the quality of the results when the reference counter is incremented.
 
- - Every time the reference counter is incremented, data from that PR is saved onto Caltech's central cluster. And that solution's dataset is the new reference dataset that all future PRs are compared against (until the reference counter is incremented again).
+ - When a PR passes CI on buildkite while in the github merge queue, or when a PR lands on the main branch, data from the HEAD commit of that PR is saved onto Caltech's central cluster. And that solution's dataset is the new reference dataset that all future PRs are compared against (until the reference counter is incremented again). So, a PR will have some number of comparable references (including zero). For example, if we line up pull requests in the order that they are merged:
+
+```
+0186_73hasd ...
+
+0187_73hasd # PR 1000 has 0 comparable references
+0187_fgsae7 # PR 2309 has 1 comparable references
+0187_sdf63a # PR 1412 has 2 comparable references
+
+0188_73hasd # PR 2359 has 0 comparable references
+
+0189_sdf63a # PR 9346 has 0 comparable references
+0189_73hasd # PR 3523 has 1 comparable references
+...
+```
+
+Note: We currently do not prepend the folder names by the reference counter, however, we will make this improvement soon.
+
+## Allowing flaky tests
+
+Users can add the flag `test_broken_report_flakiness` to the `test_mse.jl` script: `julia --project=examples reproducibility_tests/test_mse.jl --test_broken_report_flakiness true`, which will have the following behavior:
+
+ - If the test is not reproducible (i.e., flaky) when compared against `N` comparable references, then the test will pass and be reported as broken.
+ - If the test is reproducible when compared against `N` comparable references, then the test will fail `@test_broken`, and users will be asked to fix the broken test. At which point you can remove the `--test_broken_report_flakiness true` flag from that particular job, reinforcing a strict reproducibility constraint.
 
 ## How to update mse tables
 
@@ -72,7 +95,7 @@ Reprodicibility tests are performed at the end of `examples/hybrid/driver.jl`, a
  0) Run a simulation, with a particular `job_id`, to the final time.
  1) Load a dictionary, `all_best_mse`, of previous "best" mean-squared errors from `mse_tables.jl` and extract the mean squared errors for the given `job_id` (store in job-specific dictionary, `best_mse`).
  2) Export the solution (a `FieldVector`) at the final simulation time to an `NCDataset` file.
- 3) Compute the errors between the exported solution and the exported solution from the reference `NCDataset` file (which is saved in a dedicated folder on the Caltech Central cluster) and save into a dictionary, called `computed_mse`.
+ 3) Compute the errors between the exported solution and the exported solution from the reference `NCDataset` files (which are saved in a dedicated folders on the Caltech Central cluster) and save into a dictionary, called `computed_mse`.
  4) Export this dictionary (`computed_mse`) to the output folder
  5) Test that `computed_mse` is no worse than `best_mse` (determines if reproducibility test passes or not).
 
@@ -89,28 +112,29 @@ To think about tracking which dataset to compare against, it's helpful to consid
 Reference            hash of          hash of
  counter             merged           reference
 ref_counter.jl       commit            commit
-   1             =>  "V50XdC"  =>    "V50XdC" # Self reference
+   1             =>  "V50XdC"  =>    "V50XdC" # no comparable references
    1             =>  "lBKsAn"  =>    "V50XdC"
    1             =>  "Eh2ToX"  =>    "V50XdC"
-   2             =>  "bnMLxi"  =>    "bnMLxi" # Self reference
+   2             =>  "bnMLxi"  =>    "bnMLxi" # no comparable references
    2             =>  "Jjx16f"  =>    "bnMLxi"
-   3             =>  "dHkJqc"  =>    "dHkJqc" # Self reference
+   3             =>  "dHkJqc"  =>    "dHkJqc" # no comparable references
    3             =>  "SIgf1i"  =>    "dHkJqc"
    3             =>  "vTsCoY"  =>    "dHkJqc"
    3             =>  "VvCzAH"  =>    "dHkJqc"
 ```
 
 The way this works is:
 
- 1) We start off with a self reference: print a new reference
+ 1) We start off with no comparable references: print a new reference
     counter in the `print new reference counter` job.
 
  2) (PR author) copy-paste counter into `reproducibility_tests/ref_counter.jl`
 
  3) Upon next CI run, before performing CI test,
-    we check if the counter indicates a self-reference by
-    checking if `reproducibility_tests/ref_counter.jl` in the PR
-    matches (e.g.,) `aRsVoY/ref_counter.jl` in the last
-    merged commit (on central). If yes, then it's a self
-    reference, if not, then we look-up the dataset based
-    on the counter.
+    we check if the counter indicates the existence of comparable
+    references by checking if `reproducibility_tests/ref_counter.jl`
+    in the PR matches (for example) `aRsVoY/ref_counter.jl` in the last
+    merged commit (on central). If there are comparable references,
+    we compare against them and require they pass our
+    reproducibility tests, if not, then we throw a warning to let
+    users know that they should visually verify the simulation results.
diff --git a/reproducibility_tests/compute_mse.jl b/reproducibility_tests/compute_mse.jl
@@ -3,7 +3,7 @@ import NCDatasets
 import Tar
 import ClimaCoreTempestRemap as CCTR
 
-include("self_reference_or_path.jl")
+include("latest_comparable_paths.jl")
 
 function get_nc_data(ds, var::String)
     if haskey(ds, var)
@@ -61,86 +61,96 @@ function reproducibility_test(;
 )
     local ds_filename_reference
     reference_keys = map(k -> varname(k), collect(keys(reference_mse)))
+    paths = String[] # initialize for later handling
 
     if haskey(ENV, "BUILDKITE_COMMIT")
-        path = self_reference_or_path()
-        path == :self_reference && return reference_mse
-        ds_filename_reference = joinpath(path, ds_filename_computed)
+        paths = latest_comparable_paths(10)
+        isempty(paths) && return (reference_mse, paths)
         @info "`ds_filename_computed`: `$ds_filename_computed`"
-        @info "`ds_filename_reference`: `$ds_filename_reference`"
-        job_dir = dirname(ds_filename_reference)
-        nc_tar = joinpath(job_dir, "nc_files.tar")
-        # We may have converted to tarball, try to
-        # extract nc files from tarball first:
-        if !isfile(ds_filename_reference)
-            if isfile(nc_tar)
-                mktempdir(joinpath(job_dir, tempdir())) do tdir
-                    # We must extract to an empty folder, let's
-                    # move it back to job_dir after.
-                    Tar.extract(nc_tar, tdir) do hdr
-                        basename(hdr.path) == basename(ds_filename_reference)
+        ds_filename_references =
+            map(p -> joinpath(p, ds_filename_computed), paths)
+        for ds_filename_reference in ds_filename_references
+            @info "`ds_filename_reference`: `$ds_filename_reference`"
+            job_dir = dirname(ds_filename_reference)
+            nc_tar = joinpath(job_dir, "nc_files.tar")
+            # We may have converted to tarball, try to
+            # extract nc files from tarball first:
+            if !isfile(ds_filename_reference)
+                if isfile(nc_tar)
+                    mktempdir(joinpath(job_dir, tempdir())) do tdir
+                        # We must extract to an empty folder, let's
+                        # move it back to job_dir after.
+                        Tar.extract(nc_tar, tdir) do hdr
+                            basename(hdr.path) ==
+                            basename(ds_filename_reference)
+                        end
+                        mv(
+                            joinpath(tdir, basename(ds_filename_reference)),
+                            joinpath(job_dir, basename(ds_filename_reference));
+                            force = true,
+                        )
                     end
-                    mv(
-                        joinpath(tdir, basename(ds_filename_reference)),
-                        joinpath(job_dir, basename(ds_filename_reference));
-                        force = true,
-                    )
+                else
+                    @warn "There is no reference dataset, and no NC tar file."
                 end
-            else
-                @warn "There is no reference dataset, and no NC tar file."
             end
-        end
-        if !isfile(ds_filename_reference)
-            msg = "\n\n"
-            msg *= "Pull request author:\n"
-            msg *= "    It seems that a new dataset,\n"
-            msg *= "\n"
-            msg *= "dataset file:`$(ds_filename_computed)`,"
-            msg *= "\n"
-            msg *= "    was created, or the name of the dataset\n"
-            msg *= "    has changed. Please increment the reference\n"
-            msg *= "    counter in `reproducibility_tests/ref_counter.jl`.\n"
-            msg *= "\n"
-            msg *= "    If this is not the case, then please\n"
-            msg *= "    open an issue with a link pointing to this\n"
-            msg *= "    PR and build.\n"
-            msg *= "\n"
-            msg *= "For more information, please find\n"
-            msg *= "`reproducibility_tests/README.md` and read the section\n\n"
-            msg *= "  `How to merge pull requests (PR) that get approved\n"
-            msg *= "   but *break* reproducibility tests`\n\n"
-            msg *= "for how to merge this PR."
-            error(msg)
+            if !isfile(ds_filename_reference)
+                msg = "\n\n"
+                msg *= "Pull request author:\n"
+                msg *= "    It seems that a new dataset,\n"
+                msg *= "\n"
+                msg *= "dataset file:`$(ds_filename_computed)`,"
+                msg *= "\n"
+                msg *= "    was created, or the name of the dataset\n"
+                msg *= "    has changed. Please increment the reference\n"
+                msg *= "    counter in `reproducibility_tests/ref_counter.jl`.\n"
+                msg *= "\n"
+                msg *= "    If this is not the case, then please\n"
+                msg *= "    open an issue with a link pointing to this\n"
+                msg *= "    PR and build.\n"
+                msg *= "\n"
+                msg *= "For more information, please find\n"
+                msg *= "`reproducibility_tests/README.md` and read the section\n\n"
+                msg *= "  `How to merge pull requests (PR) that get approved\n"
+                msg *= "   but *break* reproducibility tests`\n\n"
+                msg *= "for how to merge this PR."
+                error(msg)
+            end
         end
     else
         @warn "Buildkite not detected. Skipping reproducibility tests."
         @info "Please review output results before merging."
-        return reference_mse
+        return (reference_mse, paths)
     end
 
     local computed_mse
     @info "Prescribed reference keys $reference_keys"
     dict_computed = to_dict(ds_filename_computed, reference_keys)
-    dict_reference = to_dict(ds_filename_reference, reference_keys)
+    dict_references =
+        map(ds -> to_dict(ds, reference_keys), ds_filename_references)
     @info "Computed keys $(collect(keys(dict_computed)))"
-    @info "Reference keys $(collect(keys(dict_reference)))"
-    try
-        computed_mse = CRT.compute_mse(;
-            job_name = string(job_id),
-            reference_keys = reference_keys,
-            dict_computed,
-            dict_reference,
-        )
-    catch err
+    @info "Reference keys $(collect(keys(first(dict_references))))"
+    if all(dr -> keys(dict_computed) == keys(dr), dict_references) && all(
+        dr -> typeof(values(dict_computed)) == typeof(values(dr)),
+        dict_references,
+    )
+        computed_mses = map(dict_references) do dict_reference
+            CRT.compute_mse(;
+                job_name = string(job_id),
+                reference_keys = reference_keys,
+                dict_computed,
+                dict_reference,
+            )
+        end
+    else
         msg = ""
         msg *= "The reproducibility test broke. Please find\n"
         msg *= "`reproducibility_tests/README.md` and read the section\n\n"
         msg *= "  `How to merge pull requests (PR) that get approved but *break* reproducibility tests`\n\n"
         msg *= "for how to merge this PR."
-        @info msg
-        rethrow(err)
+        error(msg)
     end
-    return computed_mse
+    return (computed_mses, paths)
 
 end