Skip to content

Commit

Permalink
Merge pull request #3410 from CliMA/ck/repro_tests2
Browse files Browse the repository at this point in the history
Allow for flakiness in reproducibility tests
  • Loading branch information
charleskawczynski authored Nov 7, 2024
2 parents 7b6bfab + 3cb5e84 commit aa735e9
Show file tree
Hide file tree
Showing 13 changed files with 379 additions and 231 deletions.
4 changes: 4 additions & 0 deletions .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -317,6 +317,10 @@ steps:
julia --color=yes --project=examples examples/hybrid/driver.jl
--config_file $CONFIG_PATH/sphere_aquaplanet_rhoe_equilmoist_allsky_gw_res.yml
--job_id sphere_aquaplanet_rhoe_equilmoist_allsky_gw_res
julia --color=yes --project=examples reproducibility_tests/test_mse.jl --test_broken_report_flakiness true
--job_id sphere_aquaplanet_rhoe_equilmoist_allsky_gw_res
--out_dir sphere_aquaplanet_rhoe_equilmoist_allsky_gw_res/output_active
artifact_paths: "sphere_aquaplanet_rhoe_equilmoist_allsky_gw_res/output_active/*"
agents:
slurm_mem: 20GB
Expand Down
8 changes: 4 additions & 4 deletions examples/hybrid/driver.jl
Original file line number Diff line number Diff line change
Expand Up @@ -148,15 +148,15 @@ if ClimaComms.iamroot(config.comms_ctx)
joinpath(
pkgdir(CA),
"reproducibility_tests",
"self_reference_or_path.jl",
"latest_comparable_paths.jl",
),
)
@info "Plotting"
path = self_reference_or_path() # __build__ path (not job path)
if path == :self_reference
paths = latest_comparable_paths() # __build__ path (not job path)
if isempty(paths)
make_plots(Val(Symbol(reference_job_id)), simulation.output_dir)
else
main_job_path = joinpath(path, reference_job_id)
main_job_path = joinpath(first(paths), reference_job_id)
nc_dir = joinpath(main_job_path, "nc_files")
if ispath(nc_dir)
@info "nc_dir exists"
Expand Down
52 changes: 38 additions & 14 deletions reproducibility_tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,37 @@ Our solution to dealing with failure modes is by providing users with two workfl
- [Update mse tables](#How-to-update-mse-tables)

- A comparable reference dataset does **not** exists:
- Increment the reference counter in `reproducibility_tests/ref_counter.jl`. This triggers a "self-reference".
- Increment the reference counter in `reproducibility_tests/ref_counter.jl`.
- [Update mse tables](#How-to-update-mse-tables) _all to zero values_

At this moment, it's crucial to mention several important points:

- When a reference dataset does not exist, we still perform a reproducibility test so that we continuously exercise the testing infrastructure. However, we compare the solution dataset with itself (which we call a "self-reference"). Therefore, _all reproducibility tests for all jobs will pass_ (no matter what the results look like) when the reference counter is incremented. So, it is important to review the quality of the results when the reference counter is incremented.
- When a reference dataset does not exist, we still perform a reproducibility test so that we continuously exercise the testing infrastructure. However, we compare the solution dataset with itself. Therefore, _all reproducibility tests for all jobs will pass_ (no matter what the results look like) when the reference counter is incremented. So, it is important to review the quality of the results when the reference counter is incremented.

- Every time the reference counter is incremented, data from that PR is saved onto Caltech's central cluster. And that solution's dataset is the new reference dataset that all future PRs are compared against (until the reference counter is incremented again).
- When a PR passes CI on buildkite while in the github merge queue, or when a PR lands on the main branch, data from the HEAD commit of that PR is saved onto Caltech's central cluster. And that solution's dataset is the new reference dataset that all future PRs are compared against (until the reference counter is incremented again). So, a PR will have some number of comparable references (including zero). For example, if we line up pull requests in the order that they are merged:

```
0186_73hasd ...
0187_73hasd # PR 1000 has 0 comparable references
0187_fgsae7 # PR 2309 has 1 comparable references
0187_sdf63a # PR 1412 has 2 comparable references
0188_73hasd # PR 2359 has 0 comparable references
0189_sdf63a # PR 9346 has 0 comparable references
0189_73hasd # PR 3523 has 1 comparable references
...
```

Note: We currently do not prepend the folder names by the reference counter, however, we will make this improvement soon.

## Allowing flaky tests

Users can add the flag `test_broken_report_flakiness` to the `test_mse.jl` script: `julia --project=examples reproducibility_tests/test_mse.jl --test_broken_report_flakiness true`, which will have the following behavior:

- If the test is not reproducible (i.e., flaky) when compared against `N` comparable references, then the test will pass and be reported as broken.
- If the test is reproducible when compared against `N` comparable references, then the test will fail `@test_broken`, and users will be asked to fix the broken test. At which point you can remove the `--test_broken_report_flakiness true` flag from that particular job, reinforcing a strict reproducibility constraint.

## How to update mse tables

Expand Down Expand Up @@ -72,7 +95,7 @@ Reprodicibility tests are performed at the end of `examples/hybrid/driver.jl`, a
0) Run a simulation, with a particular `job_id`, to the final time.
1) Load a dictionary, `all_best_mse`, of previous "best" mean-squared errors from `mse_tables.jl` and extract the mean squared errors for the given `job_id` (store in job-specific dictionary, `best_mse`).
2) Export the solution (a `FieldVector`) at the final simulation time to an `NCDataset` file.
3) Compute the errors between the exported solution and the exported solution from the reference `NCDataset` file (which is saved in a dedicated folder on the Caltech Central cluster) and save into a dictionary, called `computed_mse`.
3) Compute the errors between the exported solution and the exported solution from the reference `NCDataset` files (which are saved in a dedicated folders on the Caltech Central cluster) and save into a dictionary, called `computed_mse`.
4) Export this dictionary (`computed_mse`) to the output folder
5) Test that `computed_mse` is no worse than `best_mse` (determines if reproducibility test passes or not).

Expand All @@ -89,28 +112,29 @@ To think about tracking which dataset to compare against, it's helpful to consid
Reference hash of hash of
counter merged reference
ref_counter.jl commit commit
1 => "V50XdC" => "V50XdC" # Self reference
1 => "V50XdC" => "V50XdC" # no comparable references
1 => "lBKsAn" => "V50XdC"
1 => "Eh2ToX" => "V50XdC"
2 => "bnMLxi" => "bnMLxi" # Self reference
2 => "bnMLxi" => "bnMLxi" # no comparable references
2 => "Jjx16f" => "bnMLxi"
3 => "dHkJqc" => "dHkJqc" # Self reference
3 => "dHkJqc" => "dHkJqc" # no comparable references
3 => "SIgf1i" => "dHkJqc"
3 => "vTsCoY" => "dHkJqc"
3 => "VvCzAH" => "dHkJqc"
```

The way this works is:

1) We start off with a self reference: print a new reference
1) We start off with no comparable references: print a new reference
counter in the `print new reference counter` job.

2) (PR author) copy-paste counter into `reproducibility_tests/ref_counter.jl`

3) Upon next CI run, before performing CI test,
we check if the counter indicates a self-reference by
checking if `reproducibility_tests/ref_counter.jl` in the PR
matches (e.g.,) `aRsVoY/ref_counter.jl` in the last
merged commit (on central). If yes, then it's a self
reference, if not, then we look-up the dataset based
on the counter.
we check if the counter indicates the existence of comparable
references by checking if `reproducibility_tests/ref_counter.jl`
in the PR matches (for example) `aRsVoY/ref_counter.jl` in the last
merged commit (on central). If there are comparable references,
we compare against them and require they pass our
reproducibility tests, if not, then we throw a warning to let
users know that they should visually verify the simulation results.
128 changes: 69 additions & 59 deletions reproducibility_tests/compute_mse.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ import NCDatasets
import Tar
import ClimaCoreTempestRemap as CCTR

include("self_reference_or_path.jl")
include("latest_comparable_paths.jl")

function get_nc_data(ds, var::String)
if haskey(ds, var)
Expand Down Expand Up @@ -61,86 +61,96 @@ function reproducibility_test(;
)
local ds_filename_reference
reference_keys = map(k -> varname(k), collect(keys(reference_mse)))
paths = String[] # initialize for later handling

if haskey(ENV, "BUILDKITE_COMMIT")
path = self_reference_or_path()
path == :self_reference && return reference_mse
ds_filename_reference = joinpath(path, ds_filename_computed)
paths = latest_comparable_paths(10)
isempty(paths) && return (reference_mse, paths)
@info "`ds_filename_computed`: `$ds_filename_computed`"
@info "`ds_filename_reference`: `$ds_filename_reference`"
job_dir = dirname(ds_filename_reference)
nc_tar = joinpath(job_dir, "nc_files.tar")
# We may have converted to tarball, try to
# extract nc files from tarball first:
if !isfile(ds_filename_reference)
if isfile(nc_tar)
mktempdir(joinpath(job_dir, tempdir())) do tdir
# We must extract to an empty folder, let's
# move it back to job_dir after.
Tar.extract(nc_tar, tdir) do hdr
basename(hdr.path) == basename(ds_filename_reference)
ds_filename_references =
map(p -> joinpath(p, ds_filename_computed), paths)
for ds_filename_reference in ds_filename_references
@info "`ds_filename_reference`: `$ds_filename_reference`"
job_dir = dirname(ds_filename_reference)
nc_tar = joinpath(job_dir, "nc_files.tar")
# We may have converted to tarball, try to
# extract nc files from tarball first:
if !isfile(ds_filename_reference)
if isfile(nc_tar)
mktempdir(joinpath(job_dir, tempdir())) do tdir
# We must extract to an empty folder, let's
# move it back to job_dir after.
Tar.extract(nc_tar, tdir) do hdr
basename(hdr.path) ==
basename(ds_filename_reference)
end
mv(
joinpath(tdir, basename(ds_filename_reference)),
joinpath(job_dir, basename(ds_filename_reference));
force = true,
)
end
mv(
joinpath(tdir, basename(ds_filename_reference)),
joinpath(job_dir, basename(ds_filename_reference));
force = true,
)
else
@warn "There is no reference dataset, and no NC tar file."
end
else
@warn "There is no reference dataset, and no NC tar file."
end
end
if !isfile(ds_filename_reference)
msg = "\n\n"
msg *= "Pull request author:\n"
msg *= " It seems that a new dataset,\n"
msg *= "\n"
msg *= "dataset file:`$(ds_filename_computed)`,"
msg *= "\n"
msg *= " was created, or the name of the dataset\n"
msg *= " has changed. Please increment the reference\n"
msg *= " counter in `reproducibility_tests/ref_counter.jl`.\n"
msg *= "\n"
msg *= " If this is not the case, then please\n"
msg *= " open an issue with a link pointing to this\n"
msg *= " PR and build.\n"
msg *= "\n"
msg *= "For more information, please find\n"
msg *= "`reproducibility_tests/README.md` and read the section\n\n"
msg *= " `How to merge pull requests (PR) that get approved\n"
msg *= " but *break* reproducibility tests`\n\n"
msg *= "for how to merge this PR."
error(msg)
if !isfile(ds_filename_reference)
msg = "\n\n"
msg *= "Pull request author:\n"
msg *= " It seems that a new dataset,\n"
msg *= "\n"
msg *= "dataset file:`$(ds_filename_computed)`,"
msg *= "\n"
msg *= " was created, or the name of the dataset\n"
msg *= " has changed. Please increment the reference\n"
msg *= " counter in `reproducibility_tests/ref_counter.jl`.\n"
msg *= "\n"
msg *= " If this is not the case, then please\n"
msg *= " open an issue with a link pointing to this\n"
msg *= " PR and build.\n"
msg *= "\n"
msg *= "For more information, please find\n"
msg *= "`reproducibility_tests/README.md` and read the section\n\n"
msg *= " `How to merge pull requests (PR) that get approved\n"
msg *= " but *break* reproducibility tests`\n\n"
msg *= "for how to merge this PR."
error(msg)
end
end
else
@warn "Buildkite not detected. Skipping reproducibility tests."
@info "Please review output results before merging."
return reference_mse
return (reference_mse, paths)
end

local computed_mse
@info "Prescribed reference keys $reference_keys"
dict_computed = to_dict(ds_filename_computed, reference_keys)
dict_reference = to_dict(ds_filename_reference, reference_keys)
dict_references =
map(ds -> to_dict(ds, reference_keys), ds_filename_references)
@info "Computed keys $(collect(keys(dict_computed)))"
@info "Reference keys $(collect(keys(dict_reference)))"
try
computed_mse = CRT.compute_mse(;
job_name = string(job_id),
reference_keys = reference_keys,
dict_computed,
dict_reference,
)
catch err
@info "Reference keys $(collect(keys(first(dict_references))))"
if all(dr -> keys(dict_computed) == keys(dr), dict_references) && all(
dr -> typeof(values(dict_computed)) == typeof(values(dr)),
dict_references,
)
computed_mses = map(dict_references) do dict_reference
CRT.compute_mse(;
job_name = string(job_id),
reference_keys = reference_keys,
dict_computed,
dict_reference,
)
end
else
msg = ""
msg *= "The reproducibility test broke. Please find\n"
msg *= "`reproducibility_tests/README.md` and read the section\n\n"
msg *= " `How to merge pull requests (PR) that get approved but *break* reproducibility tests`\n\n"
msg *= "for how to merge this PR."
@info msg
rethrow(err)
error(msg)
end
return computed_mse
return (computed_mses, paths)

end

Expand Down
Loading

0 comments on commit aa735e9

Please sign in to comment.