Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: inflation creating out-of-bounds values qceff #749

Open
hkershaw-brown opened this issue Oct 3, 2024 · 15 comments · May be fixed by #794
Open

bug: inflation creating out-of-bounds values qceff #749

hkershaw-brown opened this issue Oct 3, 2024 · 15 comments · May be fixed by #794
Labels
blocked waiting on something else Bug Something isn't working Critical Affects critical functionality or data QCEFF quantile conserving filters

Comments

@hkershaw-brown
Copy link
Member

hkershaw-brown commented Oct 3, 2024

🐛 Your bug may already be reported!
Please search on the issue tracker before creating a new issue.

Describe the bug

  1. List the steps someone needs to take to reproduce the bug.
    Run CAM-FV reanalysis single run of filter example /glade/derecho/scratch/hkershaw/DART/CAM-out-of-bounds/Rean_run
    inf_flavor 2

  2. What was the expected outcome?
    Run to completion

  3. What actually happened?

filter errors out:

ERROR FROM:
  source : bnrh_distribution_mod.f90
  routine: bnrh_cdf
  message:  Smallest ensemble member less than lower bound -4.009123520117420E-164  0.000000000000000E+000

Error Message

Please provide any error messages.

Which model(s) are you working with?

CAM-FV

Screenshots

Here are the lines of code:

sd_inflate = sqrt(inflate)
ens = ens * sd_inflate + mean * (1.0_r8 - sd_inflate)

Version of DART

Which version of DART are you using?
v11.8.1

Have you modified the DART code?

No

Here is a small reproducer

program test_maths

! Using selected_real_kind for double precision
integer, parameter :: dp = selected_real_kind(15, 307)
real(kind=dp) :: ens
real(kind=dp) :: mean
real(kind=dp) :: inflate  
real(kind=dp) :: sd_inflate

ens = 3.2342393548165711e-191_dp
inflate = 1.4082289753147326_dp
mean = 2.1474965713154784e-163_dp

print *, ens    

sd_inflate = sqrt(inflate) 
ens = ens * sd_inflate + mean * (1.0_dp - sd_inflate)

print *, ens    

end program test_maths  
hkershaw@derecho3:/glade/derecho/scratch/hkershaw/DART/Bugs/qceff-maths$  ifort fortran_maths.f90 
hkershaw@derecho3:/glade/derecho/scratch/hkershaw/DART/Bugs/qceff-maths$ ./a.out 
  3.234239354816571E-191
 -4.009123520117420E-164

Or you can run CAM-FV filter
/glade/derecho/scratch/hkershaw/DART/CAM-out-of-bounds/Rean_run

The numbers for the test_maths.f90 program are from proc3 (out of 128) j==61138

Build information

Please describe:

  1. Derecho
  2. intel

also mac gfortran GNU Fortran (MacPorts gcc13 13.2.0_4+stdlib_flag) 13.2.0

[hkershaw:problem]() > gfortran fortran_maths.f90
[hkershaw:problem]() > ./a.out 
   3.2342393548165711E-191
  -4.0091235201174199E-164
@hkershaw-brown hkershaw-brown added the QCEFF quantile conserving filters label Oct 3, 2024
@hkershaw-brown
Copy link
Member Author

there are some early returns in probit_transform_mod

! Do not know what to do if sd of original ensemble is 0 (or small, work on this later)
if(get_bnrh_sd(p) <= 0.0_r8) then
! Just return the original ensemble
probit_ens = state_ens

@hkershaw-brown
Copy link
Member Author

I think this may be a reproducer for at least part of #681 and also why #709

@hkershaw-brown
Copy link
Member Author

whole ensemble:

before line 559

"i","value"
1,0
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,0
10,0
11,0
12,0
13,0
14,0
15,0
16,0
17,0
18,0
19,0
20,0
21,0
22,0
23,0
24,0
25,0
26,0
27,0
28,0
29,0
30,0
31,0
32,0
33,0
34,0
35,0
36,0
37,0
38,0
39,0
40,0
41,0
42,0
43,0
44,0
45,0
46,0
47,3.2342393548165711e-191
48,0
49,0
50,0
51,0
52,0
53,0
54,0
55,0
56,0
57,0
58,0
59,0
60,0
61,0
62,0
63,0
64,0
65,1.5319740054648927e-206
66,1.3760414660445314e-201
67,0
68,0
69,0
70,1.0275904858792872e-161
71,0
72,0
73,0
74,0
75,0
76,0
77,0
78,0
79,6.9040677117309552e-162
80,0

after inflation

"i","value"
1,-4.0091235201174199e-164
2,-4.0091235201174199e-164
3,-4.0091235201174199e-164
4,-4.0091235201174199e-164
5,-4.0091235201174199e-164
6,-4.0091235201174199e-164
7,-4.0091235201174199e-164
8,-4.0091235201174199e-164
9,-4.0091235201174199e-164
10,-4.0091235201174199e-164
11,-4.0091235201174199e-164
12,-4.0091235201174199e-164
13,-4.0091235201174199e-164
14,-4.0091235201174199e-164
15,-4.0091235201174199e-164
16,-4.0091235201174199e-164
17,-4.0091235201174199e-164
18,-4.0091235201174199e-164
19,-4.0091235201174199e-164
20,-4.0091235201174199e-164
21,-4.0091235201174199e-164
22,-4.0091235201174199e-164
23,-4.0091235201174199e-164
24,-4.0091235201174199e-164
25,-4.0091235201174199e-164
26,-4.0091235201174199e-164
27,-4.0091235201174199e-164
28,-4.0091235201174199e-164
29,-4.0091235201174199e-164
30,-4.0091235201174199e-164
31,-4.0091235201174199e-164
32,-4.0091235201174199e-164
33,-4.0091235201174199e-164
34,-4.0091235201174199e-164
35,-4.0091235201174199e-164
36,-4.0091235201174199e-164
37,-4.0091235201174199e-164
38,-4.0091235201174199e-164
39,-4.0091235201174199e-164
40,-4.0091235201174199e-164
41,-4.0091235201174199e-164
42,-4.0091235201174199e-164
43,-4.0091235201174199e-164
44,-4.0091235201174199e-164
45,-4.0091235201174199e-164
46,-4.0091235201174199e-164
47,-4.0091235201174199e-164
48,-4.0091235201174199e-164
49,-4.0091235201174199e-164
50,-4.0091235201174199e-164
51,-4.0091235201174199e-164
52,-4.0091235201174199e-164
53,-4.0091235201174199e-164
54,-4.0091235201174199e-164
55,-4.0091235201174199e-164
56,-4.0091235201174199e-164
57,-4.0091235201174199e-164
58,-4.0091235201174199e-164
59,-4.0091235201174199e-164
60,-4.0091235201174199e-164
61,-4.0091235201174199e-164
62,-4.0091235201174199e-164
63,-4.0091235201174199e-164
64,-4.0091235201174199e-164
65,-4.0091235201174199e-164
66,-4.0091235201174199e-164
67,-4.0091235201174199e-164
68,-4.0091235201174199e-164
69,-4.0091235201174199e-164
70,1.215420420032912e-161
71,-4.0091235201174199e-164
72,-4.0091235201174199e-164
73,-4.0091235201174199e-164
74,-4.0091235201174199e-164
75,-4.0091235201174199e-164
76,-4.0091235201174199e-164
77,-4.0091235201174199e-164
78,-4.0091235201174199e-164
79,8.1528847158862943e-162
80,-4.0091235201174199e-164

@hkershaw-brown
Copy link
Member Author

from_probit_bounded_normal_rh state_ens gets set to probit_ens

Screenshot 2024-10-03 at 2 16 19 PM Screenshot 2024-10-03 at 2 16 33 PM

not sure why the more_params(7) tail_sd_eft is negative? get_bnrh_sd(p)

! Get ensemble mean and sd
call normal_mean_sd(x, ens_size, mean, tail_sd_left)
tail_sd_right = tail_sd_left
! Don't know what to do if sd is 0; tail_sds returned are illegal value to indicate this
if(tail_sd_left <= 0.0_r8) then
tail_sd_left = -99_r8
tail_sd_right = -99_r8
return
endif

@hkershaw-brown
Copy link
Member Author

The sd == 0 so you never transform into (or back out of) probit space.

Screenshot 2024-10-03 at 3 29 19 PM

Inflation pushes the
I think in this case
ens = (ens - mean) * sqrt(inflate) + mean
if your ens(i) == mean then you can not inflate, but this is only going to work for multiplicative inflation. If you're adding noise or whatever various inflation methods do, I think you will have to enforce the bounds. Is this ok? clamping rather than qceff enforced bounds.

 555       ! Spread the ensemble out linearly for deterministic
 556       sd_inflate = sqrt(inflate)
 557       ! HK if ens == mean do nothing
 558       where ((ens - mean) > epsilon(mean) )
 559          ! Following line can lead to inflation of 1.0 changing ens on some compilers
 560          !!! ens = (ens - mean) * sqrt(inflate) + mean
 561          ! Following gives 1.0 inflation having no impact on known compilers
 562          ens = ens * sd_inflate + mean * (1.0_r8 - sd_inflate)
 563       endwhere

@hkershaw-brown
Copy link
Member Author

note there is issue #748 which is a different inflation QCEFF bug which looks like it may create out-of-bounds values also.

@jlaucar
Copy link
Contributor

jlaucar commented Dec 4, 2024

This bug impacted the application of inflation, but it was potentially much more insidious. As noted by Helen, the problem occured when the bnrh transform was unable to complete because the ensemble standard deviation was not positive. This same problem can arise for all of the other places in assim_tools_mod.f90 where transforms are made. The most straightforward solution begins by having an error return from the probit transform. Only the bnrh can fail so all other types of distributions always complete and do not need to be considered. The original code design assumed that if the standard deviation was not positive, that all ensemble members must have the same value. However, as noted in Helen's examples, if the differences between ensemble members are very small, computational precision limits can result in a computed standard deviation that is not positive. The assumption for the inflation and the other parts of assim_tools was that any distribution with all members identical could not be changed so it didn't matter if the failure of the transform was just ignored.

Note that the numerical analysis is very complex for the case where the computed SD for the bnrh transform is very small but non-zero. A number of tests suggest that this does not have inappropriate results, but I would not be shocked if something bad could happen in unusual cases.

@jlaucar
Copy link
Contributor

jlaucar commented Dec 23, 2024

Moha and Jeff did a code review of the proposed bug fix which led to one additional clarification to the code. The Lorenz_06_tracer tests seem to be okay. This version has been pushed and is now ready for testing with the CAM reanalysis code by Kevin.

edit: change is 1f2b7e1

@hkershaw-brown
Copy link
Member Author

note for performance later, to_probit_bounded_normal_rh returns the original ensemble if the transform fails:

! Fail if sd is not positive (or small, work on this later)
if(get_bnrh_sd(p) <= 0.0_r8) then
ierr = 1
! Just return the original ensemble
probit_ens = state_ens
! Free up the storage that would have been used for the transformed variables
call deallocate_distribution_params(p)
return

check on ierr, and set again here: not needed?

do i = 1, num_vars
call transform_to_probit(ens_size, state_ens(1:ens_size, i), distribution_type(i), &
p(i), temp_ens, use_input_p, bounded_below, bounded_above, lower_bound, upper_bound, ierr(i))
if(ierr(i) == 0)
probit_ens(1:ens_size, i) = temp_ens
else
! If transform failed, return the input state
probit_ens(1:ens_size, i) = state_ens(1:ens_size, i)
endif
end do

@kdraeder
Copy link
Contributor

I've tidied my local branch, in which I set up the previous BNRH tests.
Then I set it to track the qceff_inflation_fix branch by fetching the _fix branch and rebasing to it.
Lots of assimilation_code/modules/assimilation files were updated,
so I'm assuming that they are the new ones. I'll check, if anyone thinks that's necessary.

Then we need to define the test.
I could continue the previous test because all of the restarts, etc. (2020-02-01-00000)
seem to be in the RUNDIR, but with 2 changes; rebuilding filter and turning off the fix_bound_violations.
The alternative is to set up a new case and start it from the beginning, where we first saw the negative values.
This would also have fix_bound_violations turned off.
I haven't checked the ease of access to those restart files.

Are there other things to check?
Diagnostics to turn on ?

@jlaucar
Copy link
Contributor

jlaucar commented Dec 30, 2024 via email

@kdraeder
Copy link
Contributor

My notes say that it failed during the first assimilation for ~26,000 variables while processing
almost a million obs x number_close_variables.
That was started from a Reanalysis restart set, so it's well spun up.
So it seems to me highly likely that the failures would have continued.
I didn't test that because we were trying to get results for an imminent talk.

I could run one cycle of the existing case with the old filter and the namelist fix turned off.
If it fails, then we have a definitive test case. If not, I'll set up a new case to start at the beginning.

@kdraeder
Copy link
Contributor

Using the old filter with the namelist negative fixer turned off led to a failure somewhere in the first 10,000 obs:
Before observation assimilation TIME: 2024/12/31 10:23:22
ERROR FROM:
source : bnrh_distribution_mod.f90
routine: bnrh_cdf
message: Smallest ensemble member less than lower bound -1.589402457140229E-156 0.000000000000000E+000

Replacing filter with one compiled from the latest qceff_inflation_fix branch resulted in a successful assimilation, which used the same set of state input files as the failed assimilation.
Before observation assimilation TIME: 2024/10/11 16:12:42
PE 0: comp_cov_factor: Standard Gaspari Cohn localization selected
Processing observation 10000 of 1059656 TIME: 2024/10/11 16:12:44
... 1040000 more with no errors.

@kdraeder
Copy link
Contributor

kdraeder commented Jan 6, 2025

I set up a new experiment to start from the same date as the assimilation which successfully used
fix_bounds_violations (2019-12-30-00000), but turned off fix_b... and used the qceff_inflation_fix code.
I ran it for 8 cycles and compared the obs space output of the 2 cases.
The printed values of the RMSE and bias "grand" statistics are identical, up to the 5 digits printed,
for all obs types, levels, and regions. That's 873 pairs.
I've also looked at several dozen pictures (RAD T, GPS, SAT U, % of obs used, all 3 regions) to confirm.

@kdraeder
Copy link
Contributor

kdraeder commented Jan 6, 2025

@hkershaw-brown wrote in email:

So I'm on the same page here, this test is comparing 2 versions of a given code (Kevin's ~DART/Manhattan_git)

KDM = Kevin's ~DART/Manhattan_git code pre Jeff's qceff_inflation_fix commits
KDM+fix = Kevin's ~DART/Manhattan_git code with Jeff's qceff_inflation_fix commits

Where qceff_inflation_fix commits are some code that records true/false whether a transform to probit is successful. If false, the state elements/fwd ops are not part of the assimilation.

The tests run are:

A. KDM fix_bound_violations = .true.
B. KDM fix_bound_violations = .false.
C. KDM+fix fix_bound_violations = .true.
D. KDM+fix fix_bound_violations = .false.

Pass:
A run to completion
B errors out with "Smallest ensemble member less than lower bound ..."
A gives bitwise identical results to C
A give bitwise identical results to D

Fail for this test would be
C errors out with "Smallest ensemble member less than lower bound ..."
D errors out with "Smallest ensemble member less than lower bound ..."
C not bitwise with D
C(or D) not bitwise with A

From this we learn:

  • Failing a transform to probit is equivalent to clamping the bounds? (state or fwd op not updated)
  • Fix_bounds_violations is no longer needed as a namelist option?

The test is using a different version of cam model_mod to what we have in main. This is not relevant for running tests? ie. we can use main cam-fv model_mod to test with when we change qceff_inflation_fix to bring it into main.
The model_mod differences between KDM and main, which seems to be excluding obs within N Pa of the model surface is not relevant to these tests:

/glade/~raeder/DART/Manhattan_git(qceff_cam-fv)$ gitdiff --name-only \
qceff_cam-fv..origin/qceff_inflation_fix

  • models/cam-fv/model_mod.f90
  • models/cam-fv/model_mod.nml
  • models/cam-fv/shell_scripts/cesm2_1_QCEFF/DART_config.template
  • models/cam-fv/shell_scripts/cesm2_1_QCEFF/assimilate.csh.template
  • models/cam-fv/shell_scripts/cesm2_1_QCEFF/list_of_files_to_commit.csh
  • models/cam-fv/shell_scripts/cesm2_1_QCEFF/setup_advanced
  • models/cam-fv/work/input.nml

@kdraeder responded:
I haven't run C. yet, since it wouldn't have told me whether the fix_bounds_violation or the _fix
was preventing the negatives. So we can't say yet "A gives bitwise identical results to C"
I'll try that for completeness if we need it, which would also determine whether C and D
are bitwise identical.

KDM has code for ignoring obs near the surface, but it was turned off via namelist for all of these experiments.

@jlaucar jlaucar linked a pull request Jan 9, 2025 that will close this issue
15 tasks
@hkershaw-brown hkershaw-brown added the blocked waiting on something else label Jan 21, 2025
@hkershaw-brown hkershaw-brown pinned this issue Jan 21, 2025
@hkershaw-brown hkershaw-brown added the Critical Affects critical functionality or data label Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked waiting on something else Bug Something isn't working Critical Affects critical functionality or data QCEFF quantile conserving filters
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants