NaNs observed in tridiagonal eigensolver after "bulkerification" #953

RMeli · 2023-08-11T15:49:17Z

DLA-Future's tridiagonal eigensolver returns NaNs for some matrices obtained from CP2K. This appears to happen after the "bulkerification" of PRs #860 and #904.

Investigation

Observations in CP2K:

The issue persisted after heavy refactoring of CP2K/DLA-Future integration
- Input matrix in CP2K and DLA-Future match (checked a bunch of elements)
For QS/H2O-32 and QS/H2O-64 results (total energy) appear to match ScaLAPACK
- For some block sizes, some eigenvalues are not sorted in ascending order
For QS/H2O-128 nans are produced and CP2K crashes, for some block sizes
- Block size of 256 and 512 crash
- Block size of 1024 works (same total energy as ScaLAPACK)
The SCF step at which nans are produced appears to depend on the block size

DLA-Future investigation:

Output matrices in NumPy format from eigensolver
- Circumvent issues with incompatible MPICH versions between CP2K/dbcsr and HDF5
Analysis of input and output NumPy matrices
- numpy.linalg.eigh does not produce nans
- NumPy's eigenvalues are sorted in ascending order
- Some of DLA-Future eigenvalues are not in ascending order (but they match the corresponding eigenvector)
Output matrices in NumPy format for the tridiagonal eigensolver (together with @albestro)
- numpy.linalg.eigh and scipy.linalg.eigh_tridiagonal do not produce nans
- NumPy matrices converted to HDF5 matrices in a DLA-Future-compatible format
- miniapp_tridiag_solver produces nans
  - Also for the local eigensolver
  - Also with a single pika thread
Manual bisection (with cherry-picking of the C API and the HDF5 functionality)
- Started from Enforce single threaded parallel regions for blas/lapack calls #668
- Cherry picked ScaLAPACK-like interface for Cholesky decomposition and eigensolver #886 and Introduce HDF5 #880
- For the distributed trigiadonal solver nans start to appear after TridiagSolver (distributed): "bulkerify" rank1 problem solution #904
- For the local tridiagonal solver nans start to appear after TridiagSolver (local): "bulkerify" rank1 problem solution #860
Code inspection (I'm not familiar with this part of DLA-Future)
- The expression std::sqrt(-w[i]) immediately jumped to the eyes as a possible cause of nans (if w[i] > 0)
- Stepping through the code with GDB (+VScode) confirmed that occasionally (w[i] > 0)

Further investigation is ongoing, but as discussed with @albestro and @msimberg it's good to keep track of this here.

Details

CP2K

H2O-32

ScaLAPACK:

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00    0.4     2.82070305      -546.8547872795 -5.47E+02
     2 P_Mix/Diag. 0.40E+00    0.5     2.00626128      -548.3716222436 -1.52E+00
     3 P_Mix/Diag. 0.40E+00    0.5     1.20123309      -549.2863094720 -9.15E-01
     4 P_Mix/Diag. 0.40E+00    0.5     0.73381920      -549.8271179546 -5.41E-01
     5 P_Mix/Diag. 0.40E+00    0.5     0.44662463      -550.1492415452 -3.22E-01
     6 P_Mix/Diag. 0.40E+00    0.5     0.27176753      -550.3418499811 -1.93E-01
     7 P_Mix/Diag. 0.40E+00    0.5     0.16528526      -550.4572302348 -1.15E-01
     8 P_Mix/Diag. 0.40E+00    0.5     0.10047105      -550.5264082764 -6.92E-02
     9 P_Mix/Diag. 0.40E+00    0.5     0.06104581      -550.5679020528 -4.15E-02
    10 DIIS/Diag.  0.11E-03    1.0     0.03768831      -550.5927951629 -2.49E-02
    11 DIIS/Diag.  0.73E-05    0.5     0.00001951      -550.6301331620 -3.73E-02
    12 DIIS/Diag.  0.17E-05    0.6     0.00000433      -550.6301331629 -9.00E-10

  *** SCF run converged in    12 steps ***

DLA-Future:

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00    1.1     2.82070305      -546.8547872795 -5.47E+02
     2 P_Mix/Diag. 0.40E+00    1.2     2.00626128      -548.3716222436 -1.52E+00
     3 P_Mix/Diag. 0.40E+00    1.2     1.20123309      -549.2863094720 -9.15E-01
     4 P_Mix/Diag. 0.40E+00    1.2     0.73381920      -549.8271179546 -5.41E-01
     5 P_Mix/Diag. 0.40E+00    1.2     0.44662463      -550.1492415452 -3.22E-01
     6 P_Mix/Diag. 0.40E+00    1.2     0.27176753      -550.3418499811 -1.93E-01
     7 P_Mix/Diag. 0.40E+00    1.2     0.16528526      -550.4572302348 -1.15E-01
     8 P_Mix/Diag. 0.40E+00    1.2     0.10047105      -550.5264082764 -6.92E-02
     9 P_Mix/Diag. 0.40E+00    1.2     0.06104581      -550.5679020528 -4.15E-02
    10 DIIS/Diag.  0.11E-03    1.2     0.03768831      -550.5927951629 -2.49E-02
    11 DIIS/Diag.  0.73E-05    1.2     0.00001951      -550.6301331620 -3.73E-02
    12 DIIS/Diag.  0.17E-05    1.2     0.00000433      -550.6301331629 -8.99E-10

  *** SCF run converged in    12 steps ***

Inspecting the eigenvalues shows that some are not ordered in ascending order (cyclic shift):

>>> import numpy as np
>>> import matin
>>> import matout
# Check that there are no NaNs
>>> np.isnan(matin.a).sum()
0
>>> np.isnan(matout.evecs).sum()
0
>>> np.isnan(matout.evals).sum()
0
>>> matout.evecs[:10]
array([[ 1.49482e-01,  7.15696e-02, -4.77254e-04, ...,  7.07977e-06,
        -7.22529e-04,  4.34057e-04],
       [ 5.39234e-02,  2.46975e-02,  1.82066e-03, ..., -2.36429e-01,
        -1.80969e-01,  2.02647e-02],
       [-9.58799e-02, -4.43881e-02, -3.39398e-03, ..., -2.77497e-02,
        -2.07821e-02,  2.28366e-03],
       ...,
       [-4.52588e-02, -2.20263e-02, -9.55146e-03, ..., -4.00613e-03,
        -4.45791e-03,  1.78821e-03],
       [ 4.38312e-03,  1.98188e-03,  7.23746e-04, ..., -1.39408e-03,
        -2.72728e-03, -2.21272e-04],
       [ 2.05750e-02,  9.72287e-03,  1.94750e-03, ...,  1.30918e-04,
        -3.58272e-04,  1.87918e-03]], dtype=float32)
>>> matout.evals[:10]
array([[-0.838936],
       [-0.83703 ],
       [-0.834113],
       [-0.832972],
       [-0.816725],
       [-0.815055],
       [-0.81234 ],
       [-0.810887],
       [-0.808261],
       [-0.807243]], dtype=float32)
# Check that all elements the input evecs and evals have been changed
>>> np.isclose(matout.evals, -42.0).sum()
0
>>> np.isclose(matout.evecs, -23.0).sum()
0
>>> matout.evals[matout.evals < -10]
array([], dtype=float32)
>>> matout.evecs[matout.evecs < -10]
array([], dtype=float32)
# Use NumPy eigensolver
>>> e, ee = np.linalg.eigh(matin.a)
>>> e[:10]
array([-0.8389363 , -0.83703023, -0.83411264, -0.832972  , -0.81672525,
       -0.8150545 , -0.8123403 , -0.8108873 , -0.8082608 , -0.8072428 ],
      dtype=float32)
>>> np.allclose(matout.evals.squeeze(-1), e)
False
# Check differences
>>> diff = np.abs(matout.evals.squeeze(-1) - e)
>>> matout.evals.squeeze(-1)[diff > 1e-3]
array([4.83699, 4.86062, 4.88254, 4.81451], dtype=float32)
>>> e[diff > 1e-3]
array([4.814508 , 4.8369913, 4.860616 , 4.8825364], dtype=float32)
>>> np.where(diff > 0.001)
(array([1070, 1071, 1072, 1073]),)
>>> e[1069:1075]
array([4.8128195, 4.814508 , 4.8369913, 4.860616 , 4.8825364, 4.9199357],
      dtype=float32)
>>> matout.evals[1069:1075]
array([[4.81282],
       [4.83699],
       [4.86062],
       [4.88254],
       [4.81451],
       [4.91994]], dtype=float32)

H2O-64

ScaLAPACK:

12 DIIS/Diag.	0.76E-05	3.2	0.00000704	-1101.2367103259	-8.93E-09

DLA-Future (time missing from my notes, not relevant here):

12 DIIS/Diag.	0.76E-05		0.00000704	-1101.2367103259	-8.93E-09

The behavior is different depending on the block size.

Block size of 64:

>>> import numpy as np
>>> import matin_0
>>> import matout_0
# Check that there are no NaNs
>>> np.isnan(matin_0.a).sum()
0
>>> np.isnan(matout_0.evecs).sum()
0
>>> np.isnan(matout_0.evals).sum()
0
# Solve with NumPy
>>> e, ee = np.linalg.eigh(matin_0.a)
>>> e[:10]
array([-0.8325389 , -0.83078194, -0.8301129 , -0.828938  , -0.8274322 ,
       -0.8256345 , -0.8236141 , -0.8221858 , -0.81873846, -0.8182086 ],
      dtype=float32)
>>> matout_0.evals[:10]
array([[-0.832539],
       [-0.830782],
       [-0.830113],
       [-0.828938],
       [-0.827432],
       [-0.825635],
       [-0.823614],
       [-0.822186],
       [-0.818738],
       [-0.818209]], dtype=float32)
# Check difference with NumPy
>>> diff = np.abs(matout_0.evals.squeeze(-1) - e)
>>> matout_0.evals.squeeze(-1)[diff > 1e-3]
array([], dtype=float32)
>>> matout_0.evals.squeeze(-1)[diff > 1e-4]
array([], dtype=float32)
>>> matout_0.evals.squeeze(-1)[diff > 1e-5]
array([10.5684, 12.7143, 12.784 , 12.7919, 12.8208, 12.8527, 12.8769,
       12.9043, 12.9286, 12.9424, 12.9458, 12.981 , 13.0004, 13.0243,
       13.0349, 13.0531, 13.0781, 13.0982, 13.1076, 13.1336, 13.143 ,
       13.1551, 13.1633, 13.2074, 13.213 , 13.2453, 13.2615, 13.289 ,
       13.2937, 13.298 , 13.3325, 13.3397, 13.3659, 13.3818, 13.3879,
       13.449 , 13.483 , 13.5031, 13.5312, 13.5696, 13.5928, 13.6196,
       13.6795, 13.6988, 13.727 , 13.7304, 13.7532, 13.8624, 13.9351,
       13.9837, 14.2832, 15.2062], dtype=float32)
>>> np.where(diff > 0.001)
(array([], dtype=int64),)

Differences are small. Calculations in DLA-Future are performed in double precision, but the output is in single precision.

Block size of 128:

>>> import numpy as np
>>> import matout_0 as matout
>>> import matin_0 as matin
>>> e, ee = np.linalg.eigh(matin.a)
>>> diff = np.abs(matout.evals.squeeze(-1) - e)
>>> matout.evals.squeeze(-1)[diff > 1e-3]
array([3.12042, 3.12242, 3.14158, 3.14504, 3.11545], dtype=float32)
>>> e[diff > 1e-3]
array([3.11539  , 3.1204178, 3.1224189, 3.1415825, 3.1450453],
      dtype=float32)
>>> np.isnan(matout.a).sum()
0
>>> np.isnan(matout.evecs).sum()
5603840
>>> np.isnan(matout.evals).sum()
0
>>> np.where(diff > 0.001)
(array([1671, 1672, 1673, 1674, 1675]),)
>>> e[1670:1677]
array([3.1043088, 3.11539  , 3.1204178, 3.1224189, 3.1415825, 3.1450453,
       3.1483176], dtype=float32)
>>> matout.evals[1670:1677]
array([[3.10431],
       [3.12042],
       [3.12242],
       [3.14158],
       [3.14504],
       [3.11545],
       [3.14832]], dtype=float32)

Same cyclic shift observed for H2O-32.

H2O-128

ScaLAPACK:

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   24.0     2.31075596     -2188.1751193349 -2.19E+03
     2 P_Mix/Diag. 0.40E+00   25.7     1.56593676     -2193.9737889422 -5.80E+00
     3 P_Mix/Diag. 0.40E+00   25.5     0.93543237     -2197.4624525642 -3.49E+00
     4 P_Mix/Diag. 0.40E+00   24.6     0.57060343     -2199.5239516010 -2.06E+00
     5 P_Mix/Diag. 0.40E+00   25.7     0.34674197     -2200.7514961845 -1.23E+00
     6 P_Mix/Diag. 0.40E+00   24.8     0.21089895     -2201.4853709921 -7.34E-01
     7 P_Mix/Diag. 0.40E+00   24.8     0.12817561     -2201.9249537927 -4.40E-01
     8 P_Mix/Diag. 0.40E+00   25.8     0.07787233     -2202.1884998956 -2.64E-01
     9 DIIS/Diag.  0.20E-03   25.0     0.04803793     -2202.3465735603 -1.58E-01
    10 DIIS/Diag.  0.17E-04   26.2     0.00005206     -2202.5836419682 -2.37E-01
    11 DIIS/Diag.  0.14E-04   25.9     0.00001917     -2202.5836419845 -1.63E-08
    12 DIIS/Diag.  0.20E-04   24.9     0.00001266     -2202.5836419832  1.29E-09
    13 DIIS/Diag.  0.12E-05   26.0     0.00000152     -2202.5836419863 -3.09E-09

  *** SCF run converged in    13 steps ***

DLA-Future:

Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   15.8            NaN     -2188.1751193349 -2.19E+03

 *******************************************************************************
 *   ___                                                                       *
 *  /   \                                                                      *
 * [ABORT]                                                                     *
 *  \___/                KS energy is an abnormal value (NaN/Inf).             *
 *    |                                                                        *
 *  O/|                                                                        *
 * /| |                                                                        *
 * / \                                                     qs_ks_methods.F:889 *
 *******************************************************************************

Increasing the block size from 256 to 1024 appears to work:

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   41.5     2.31075596     -2188.1751193349 -2.19E+03
     2 P_Mix/Diag. 0.40E+00   49.0     1.56593676     -2193.9737889422 -5.80E+00
     3 P_Mix/Diag. 0.40E+00   48.5     0.93543237     -2197.4624525642 -3.49E+00
     4 P_Mix/Diag. 0.40E+00   43.0     0.57060343     -2199.5239516010 -2.06E+00
     5 P_Mix/Diag. 0.40E+00   48.3     0.34674197     -2200.7514961845 -1.23E+00
     6 P_Mix/Diag. 0.40E+00   48.7     0.21089895     -2201.4853709921 -7.34E-01
     7 P_Mix/Diag. 0.40E+00   46.0     0.12817561     -2201.9249537927 -4.40E-01
     8 P_Mix/Diag. 0.40E+00   62.1     0.07787233     -2202.1884998956 -2.64E-01
     9 DIIS/Diag.  0.20E-03   61.0     0.04803793     -2202.3465735603 -1.58E-01
    10 DIIS/Diag.  0.17E-04   45.1     0.00005206     -2202.5836419682 -2.37E-01
    11 DIIS/Diag.  0.14E-04   79.0     0.00001917     -2202.5836419845 -1.63E-08
    12 DIIS/Diag.  0.20E-04   53.5     0.00001266     -2202.5836419833  1.28E-09
    13 DIIS/Diag.  0.12E-05   48.0     0.00000152     -2202.5836419863 -3.08E-09

  *** SCF run converged in    13 steps ***

DLA-Future

Tridiagonal eigensolver miniapp

miniapp/miniapp_tridiag_solver --pika:bind=none --pika:threads=12 --grid-rows=1 --grid-cols=1 --nruns 1 --nwarmups 0 --type s --input-file ../trid-cp2k.h5

HDF5 "td-evals.h5" {
GROUP "/" {
   DATASET "evals" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1, 5120, 1 ) / ( 1, 5120, 1 ) }
      DATA {
      (0,0,0): -nan,
      (0,1,0): -0.932222,
      (0,2,0): -0.931106,
      (0,3,0): -0.929937,
      (0,4,0): -0.929327,
      (0,5,0): -0.92884,
      (0,6,0): -0.928138,
      (0,7,0): -0.928094,
      (0,8,0): -0.927308,
      (0,9,0): -0.926688,
      (0,10,0): -0.926199,
      (0,11,0): -0.925902,
      (0,12,0): -0.925131,
      (0,13,0): -0.924653,
      (0,14,0): -0.924199,
      (0,15,0): -0.923768,
      (0,16,0): -0.923444,
      (0,17,0): -0.922861,
      (0,18,0): -0.922458,
      (0,19,0): -0.922242,
      (0,20,0): -0.921639,
      (0,21,0): -0.921327,
      (0,22,0): -0.921066,
      (0,23,0): -0.920946,
      (0,24,0): -0.92021,
      (0,25,0): -0.919626,
      (0,26,0): -0.919327,
      (0,27,0): -0.91917,
      (0,28,0): -0.918547,
      (0,29,0): -0.918441,
      (0,30,0): -0.917988,
      (0,31,0): -0.917847,
      (0,32,0): -0.917361,
      (0,33,0): -0.9172,
      (0,34,0): -0.916821,
      (0,35,0): -0.916179,
      (0,36,0): -0.915846,
      (0,37,0): -0.915483,
      (0,38,0): -nan,
      (0,39,0): -nan,
      (0,40,0): -nan,
      (0,41,0): -nan,
      (0,42,0): -nan,

Experiments with block sizes

Block size of 128:

miniapp/miniapp_tridiag_solver --pika:bind=none --pika:threads=12 --grid-rows=1 --grid-cols=1 --nruns 1 --nwarmups 0 --type s --input-file ../trid-cp2k.h5 --block-size 128

GROUP "/" {
   DATASET "evals" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1, 5120, 1 ) / ( 1, 5120, 1 ) }
      DATA {
      (0,0,0): nan,
      (0,1,0): nan,
      (0,2,0): nan,
      (0,3,0): nan,
      (0,4,0): nan,
      (0,5,0): nan,
      (0,6,0): nan,
      (0,7,0): nan,
      (0,8,0): nan,
      (0,9,0): nan,
      (0,10,0): nan,
      (0,11,0): nan,
      (0,12,0): nan,
      (0,13,0): nan,
      (0,14,0): nan,
      (0,15,0): nan,
      (0,16,0): nan,
      (0,17,0): nan,
      (0,18,0): nan,
      (0,19,0): nan,
      (0,20,0): nan,
      (0,21,0): nan,
      (0,22,0): nan,
      (0,23,0): nan,
      (0,24,0): nan,
      (0,25,0): nan,
      (0,26,0): nan,
      (0,27,0): nan,
      (0,28,0): nan,
      (0,29,0): nan,
      (0,30,0): nan,
      (0,31,0): nan,
      (0,32,0): nan,
      (0,33,0): nan,
      (0,34,0): nan,
      (0,35,0): nan,
      (0,36,0): nan,
      (0,37,0): nan,
      (0,38,0): nan,
      (0,39,0): nan,
      (0,40,0): nan,
      (0,41,0): nan,
      (0,42,0): nan,
      (0,43,0): nan,

Block size of 256:

miniapp/miniapp_tridiag_solver --pika:bind=none --pika:threads=12 --grid-rows=1 --grid-cols=1 --nruns 1 --nwarmups 0 --type s --input-file ../trid-cp2k.h5 --local --block-size 256

GROUP "/" {
   DATASET "evals" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1, 5120, 1 ) / ( 1, 5120, 1 ) }
      DATA {
      (0,0,0): nan,
      (0,1,0): -0.932222,
      (0,2,0): -0.931106,
      (0,3,0): -0.929937,
      (0,4,0): -0.929327,
      (0,5,0): -0.92884,
      (0,6,0): -0.928138,
      (0,7,0): -0.928094,
      (0,8,0): -0.927308,
      (0,9,0): -0.926688,
      (0,10,0): -0.926199,
      (0,11,0): -0.925902,
      (0,12,0): -0.925131,
      (0,13,0): -0.924653,
      (0,14,0): -0.924199,
      (0,15,0): -0.923768,
      (0,16,0): -0.923444,
      (0,17,0): -0.922861,
      (0,18,0): -0.922458,
      (0,19,0): -0.922242,
      (0,20,0): -0.921639,
      (0,21,0): -0.921327,
      (0,22,0): -0.921066,
      (0,23,0): -0.920946,
      (0,24,0): -0.92021,
      (0,25,0): -0.919626,
      (0,26,0): -0.919327,
      (0,27,0): -0.91917,
      (0,28,0): -0.918547,
      (0,29,0): -0.918441,
      (0,30,0): -0.917988,
      (0,31,0): -0.917847,
      (0,32,0): -0.917361,
      (0,33,0): -0.9172,
      (0,34,0): -0.916821,
      (0,35,0): -0.916179,
      (0,36,0): -0.915846,
      (0,37,0): -0.915483,
      (0,38,0): nan,
      (0,39,0): nan,
      (0,40,0): nan,
      (0,41,0): nan,
      (0,42,0): nan,
      (0,43,0): nan,

Block size of 1024:

miniapp/miniapp_tridiag_solver --pika:bind=none --pika:threads=12 --grid-rows=1 --grid-cols=1 --nruns 1 --nwarmups 0 --type s --input-file ../trid-cp2k.h5 --block-size 1024

GROUP "/" {
   DATASET "evals" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1, 5120, 1 ) / ( 1, 5120, 1 ) }
      DATA {
      (0,0,0): -0.932222,
      (0,1,0): -0.931106,
      (0,2,0): -0.929937,
      (0,3,0): -0.929327,
      (0,4,0): -0.928841,
      (0,5,0): -0.928138,
      (0,6,0): -0.928094,
      (0,7,0): -0.927308,
      (0,8,0): -0.926688,
      (0,9,0): -0.9262,
      (0,10,0): -0.925902,
      (0,11,0): -0.925131,
      (0,12,0): -0.924653,
      (0,13,0): -0.924199,
      (0,14,0): -0.923767,
      (0,15,0): -0.923444,
      (0,16,0): -0.922861,
      (0,17,0): -0.922458,
      (0,18,0): -0.922242,
      (0,19,0): -0.921639,
      (0,20,0): -0.921327,
      (0,21,0): -0.921066,
      (0,22,0): -0.920946,
      (0,23,0): -0.92021,
      (0,24,0): -0.919625,
      (0,25,0): -0.919329,
      (0,26,0): -0.91917,
      (0,27,0): -0.918547,
      (0,28,0): -0.918441,
      (0,29,0): -0.917988,
      (0,30,0): -0.917847,
      (0,31,0): -0.917361,
      (0,32,0): -0.9172,
      (0,33,0): -0.91682,
      (0,34,0): -0.916179,
      (0,35,0): -0.915846,
      (0,36,0): -0.915483,
      (0,37,0): -0.915208,
      (0,38,0): -0.915144,
      (0,39,0): -0.914922,
      (0,40,0): -0.914535,
      (0,41,0): -0.914342,
      (0,42,0): -0.91416,
      (0,43,0): -0.91385,

Debugging

nan values stem from std::squrt(-w[i]) in solveRank1Problem
- Some of the w[i] are positive
  - Also happens with --pika:threads=1

DLA-Future/include/dlaf/eigensolver/tridiag_solver/merge.h

Line 596 in 26cf7e9

z_tiles[0].ptr()[i] = std::copysign(std::sqrt(-w[i]), z_ptr[to_sizet(i)]);

With k = 2146:

w[50] = 0.000914332923
w[51] = 4.66800714e-07
w[165] = 5.93085525e-10
w[166] = 0.00505260751

With k = 2044:

w[1790] = 1.14765066e-10

With k = 1

w[0] = 1

The text was updated successfully, but these errors were encountered:

albestro · 2023-08-18T09:14:39Z

With @RMeli we investigated a bit and here I'm going to summarise what we found out. We are not fully sure about it, but it might be relevant for fixing this issue.

sqrt(-w[i]) is the origin of NaNs (at step n), which produces NaNs because w[i] > 0 for some i
w[i] = f(q, d_ptr) and d_ptr
- q contains the eigenvectors, modified by laed4
- d_ptr is the vector with eigenvalues, which should be sorted as laed4 precondition requires

Apparently, the precondition about sorted eigenvalues is not always verified. Indeed, we added checks about d0 sorting at various step of the mergeSubproblems algorithm:

auto mergeSubProblems() {
// d0 is in "initial" state with (potentially) 4 groups
// non-deflated | deflated | non-deflated | deflated
// each pair coming from one of the two subproblems

assembleZVec(rho, e0); // -> z0 = f(e0)

// i1 is created and, from it, i2 is created for sorting all eigenvalues of both problems
CHECK_SORTED(d0, i2);          // d0-post-sort             [0,n)
BACKUP(i2);
BACKUP(d0);

applyDeflation(tol, rho, d0, z0, ...);
k = stablePartition(...); // -> i3

CHECK_BACKUP(i2);              // CHANGE i2
CHECK_BACKUP(d0);              // CHANGE d0
CHECK_SORTED(d0, i2);          // d0-post-stable-full      [0,n)
CHECK_SORTED(d0, i3);          // d0-post-stable-nd-d|k    [0,k)|[k, n)

// they are separated in non-deflated and deflated
// but each group should be sorted
sort(d0, i3); 

CHECK_SORTED(d0);              // d0-sorted|k              [0,k)|[k, n)
}

What we found out is that at some point, applyDeflation introduces values that make the eigenvalues array not fully sorted anymore.

This has been tested with the input provided by CP2K (i.e. 5120 case), with different blocksizes (128, 256).

This sorting problem appears both on pre-bulk (b44340f PR888 + patch for HDF5) and on master, but:

on master it results in NaNs
on pre-bulk it produces results, but from a quick check in CP2K, it seems that the results are not correct

I look forward to confirmation that expectations about sorting are correct as we stated in this report (@rasolca can you have a look, please?). If yes, from a naive point of view it looks like a numerical problem somewhere.

I'm not sure it will solve the NaNs that @RMeli addressed, because it might be that bulk introduces other approximations. So, for the moment I won't open a separate issue, but in the end, if we realise they are two different problems, we will open a separate issue about this sorting problem.

Details

In the output, you will find also some internal output from applyDeflation that reports just the changes done on eigenvalues, and it looks like a block like this

tol = 1.95696e-05
c = -1 s = -0.000785526
i1s = 257 i2s = 0 @ 1|2
d1 = -0.949937 d2 = -0.932222
d1 = -0.949937 d2 = -0.932222

Now, let's start with an extent of the output of the run with 256 blocksize. I took the very first part of the output and split it in parts, trying to make it easier to read.

[0]
tol = 1.95696e-05
c = -1 s = -0.000785526
i1s = 257 i2s = 0 @ 1|2
d1 = -0.949937 d2 = -0.932222
d1 = -0.949937 d2 = -0.932222
tol = 1.95696e-05
c = 0.819746 s = -0.572728
i1s = 3 i2s = 258 @ 5|6
d1 = -0.928868 d2 = -0.928827
d1 = -0.928854 d2 = -0.92884
tol = 1.95696e-05
c = -0.772187 s = -0.635396
i1s = 340 i2s = 74 @ 158|159
d1 = 1.71013 d2 = 1.71014
d1 = 1.71014 d2 = 1.71014
tol = 1.95696e-05
c = -0.00174197 s = 0.999999
i1s = 185 i2s = 186 @ 392|393
d1 = 12.7213 d2 = 12.7305
d1 = 12.7305 d2 = 12.7213
CHANGE d0 3 -0.928854 -0.928868
CHANGE d0 74 1.71014 1.71014
CHANGE d0 185 12.7305 12.7213
CHANGE d0 186 12.7213 12.7305
CHANGE d0 258 -0.92884 -0.928827
CHANGE d0 340 1.71014 1.71013
d0-post-stable-full NOT SORTED 0:512 1
393,

In the previous part it can be seen that there is an unexpected out of order value in position 393 (they should be fully sorted), and actually index 393 has been changed by applyDeflation.

tol = 1.95696e-05
c = -0.00174197 s = 0.999999
i1s = 185 i2s = 186 @ 392|393
d1 = 12.7213 d2 = 12.7305
d1 = 12.7305 d2 = 12.7213

It looks quite a big change, but I don't have deep knowledge of this algorithm. Anyway, this change does not really end up causing any issue once i3 is applied, so the precondition stands in this case.

<continue from above>
tol = 6.6807e-06
c = 1 s = 1.74006e-05
i1s = 256 i2s = 1 @ 0|2
d1 = -1.15486 d2 = -0.921493
d1 = -1.15486 d2 = -0.921493
tol = 6.6807e-06
c = 0.801806 s = -0.597584
i1s = 193 i2s = 450 @ 387|388
d1 = 6.2362 d2 = 6.23621
d1 = 6.2362 d2 = 6.2362
CHANGE d0 193 6.2362 6.2362
CHANGE d0 450 6.2362 6.23621

At this step in the output above nothing happens, there are changes, but the sorting is not altered.

<continue from above>
tol = 6.56497e-06
c = -1 s = 1.51461e-05
i1s = 0 i2s = 3 @ 0|3
d1 = -0.966395 d2 = -0.911016
d1 = -0.966395 d2 = -0.911016
tol = 6.56497e-06
c = 1 s = -1.08164e-05
i1s = 0 i2s = 5 @ 0|5
d1 = -0.966395 d2 = -0.909847
d1 = -0.966395 d2 = -0.909847
tol = 6.56497e-06
c = 1 s = -9.92931e-05
i1s = 0 i2s = 6 @ 0|6
d1 = -0.966395 d2 = -0.909007
d1 = -0.966395 d2 = -0.909007
tol = 6.56497e-06
c = 1 s = 6.15887e-05
i1s = 0 i2s = 7 @ 0|7
d1 = -0.966395 d2 = -0.908289
d1 = -0.966395 d2 = -0.908289
tol = 6.56497e-06
c = 7.6434e-06 s = 1
i1s = 17 i2s = 257 @ 18|21
d1 = -0.901174 d2 = -0.634082
d1 = -0.634082 d2 = -0.901174
CHANGE d0 17 -0.634082 -0.901174
CHANGE d0 257 -0.901174 -0.634082
d0-post-stable-full NOT SORTED 0:512 2
19, 21,
d0-post-stable-nd-d|486 NOT SORTED 486:512 1
495,
d0-sorted|486 NOT SORTED 486:512 1
495,

We can see that:

applyDeflation altered the original full-order given by i2 (index 19, 21 are out of order now)
the sorting is not ensured to the following steps of the algorithm (index 495 is out of order, but it is in the deflated part so it does not violate the precondition, at least the laed4 one)

PRE-BULK vs MASTER (with bulk)

This sorting problem exists in both versions, but apparently, just for the bulk it manifests with NaNs.

As can be seen from the diff output, at some point deflation has NaNs for post-bulk (right). Since it appears in c and s, it means that at that point z, which is populated from eigenvectors, is already filled with NaNs by the previous merge step. Indeed, the previous step about ordering reports

d0-post-stable-full NOT SORTED 0:2560 22
92, 149, 288, 495, 497, 542, 553, 657, 661, 759, 763, 803, 876, 1150, 1380, 1517, 1731, 1805, 1909, 1976, 2310, 2316,
d0-post-stable-nd-d|2044 NOT SORTED 0:2044 1
1791,
d0-post-stable-nd-d|2044 NOT SORTED 2044:2560 1
2136,
d0-sorted|2044 NOT SORTED 0:2044 1
1791,
d0-sorted|2044 NOT SORTED 2044:2560 1
2136,

and in particular

d0-sorted|2044 NOT SORTED 0:2044 1
1791,

says that d0-sorted has index 1791 out of order, which is in the non-deflated part (1791 < k=2044).

At this point I'm still not sure why pre-bulk and post-bulk behaves differently in terms of results, but for sure they address the same sorting problem (which looks like a problem from my naive knowledge of the algorithm).

albestro · 2023-08-21T17:28:18Z

As reported in #960, it was a problem existing in our tridiagonal solver since the very beginning. @rasolca confirmed what we @RMeli and I reported in the very last message, and proposed and implemented a fix.

Just for the records, a note about the bulkerification, since all initial evidences were pointing at it as "the suspect". In this specific case it ended up with NaN values, but apparently it was just a different way of manifesting the same root cause (probably due to different negligible numeric approximations, that are actually present in the two different variants of the algorithm).

albestro · 2023-08-30T12:59:02Z

See #960 (comment) for more details.

That PR might have fixed this issue, but we are still not sure. For sure, that PR fixes a problem in tridiagonal solver.

RMeli · 2023-09-04T12:52:38Z

The discrepancy from the results reported here and the ones reported in #960 are down to EPS_DEFAULT.

-      EPS_DEFAULT 1.0E-12
+      EPS_DEFAULT 1.0E-5

+  &FM
+    FORCE_BLOCK_SIZE .TRUE.
+    NCOL_BLOCKS 256
+    NROW_BLOCKS 256
+  &END

With EPS_DEFAULT=1.0E-5 I have:

DLA-Future

 SCF WAVEFUNCTION OPTIMIZATION

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   16.0     2.31075596     -2188.1751193349 -2.19E+03
     2 P_Mix/Diag. 0.40E+00   17.0     1.56593676     -2193.9737889422 -5.80E+00
     3 P_Mix/Diag. 0.40E+00   16.9     0.93543237     -2197.4624525642 -3.49E+00
     4 P_Mix/Diag. 0.40E+00   16.9     0.57060343     -2199.5239516010 -2.06E+00
     5 P_Mix/Diag. 0.40E+00   16.9     0.34674197     -2200.7514961845 -1.23E+00
     6 P_Mix/Diag. 0.40E+00   16.9     0.21089895     -2201.4853709921 -7.34E-01
     7 P_Mix/Diag. 0.40E+00   16.9     0.12817561     -2201.9249537927 -4.40E-01
     8 P_Mix/Diag. 0.40E+00   17.0     0.07787233     -2202.1884998956 -2.64E-01
     9 DIIS/Diag.  0.20E-03   16.9     0.04803793     -2202.3465735603 -1.58E-01
    10 DIIS/Diag.  0.17E-04   16.9     0.00005206     -2202.5836419682 -2.37E-01
    11 DIIS/Diag.  0.14E-04   17.0     0.00001917     -2202.5836419845 -1.64E-08
    12 DIIS/Diag.  0.20E-04   17.0     0.00001266     -2202.5836419833  1.28E-09
    13 DIIS/Diag.  0.12E-05   17.0     0.00000152     -2202.5836419863 -3.09E-09

  *** SCF run converged in    13 steps ***


  Electronic density on regular grids:      -1023.9963829224        0.0036170776
  Core density on regular grids:             1023.9974460717       -0.0025539283
  Total charge density on r-space grids:        0.0010631494
  Total charge density g-space grids:           0.0010631494

  Overlap energy of the core charge distribution:               0.00001125202266
  Self energy of the core charge distribution:              -5610.60998987709900
  Core Hamiltonian energy:                                   1652.21912796280003
  Hartree energy:                                            2289.02439878388213
  Exchange-correlation energy:                               -533.21719010794914

  Total energy:                                             -2202.58364198634308

ScaLAPACK

SCF WAVEFUNCTION OPTIMIZATION

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   41.5     2.31075596     -2188.1751193349 -2.19E+03
     2 P_Mix/Diag. 0.40E+00   43.9     1.56593676     -2193.9737889422 -5.80E+00
     3 P_Mix/Diag. 0.40E+00   43.5     0.93543237     -2197.4624525642 -3.49E+00
     4 P_Mix/Diag. 0.40E+00   43.6     0.57060343     -2199.5239516010 -2.06E+00
     5 P_Mix/Diag. 0.40E+00   43.7     0.34674197     -2200.7514961845 -1.23E+00
     6 P_Mix/Diag. 0.40E+00   43.7     0.21089895     -2201.4853709921 -7.34E-01
     7 P_Mix/Diag. 0.40E+00   43.6     0.12817561     -2201.9249537927 -4.40E-01
     8 P_Mix/Diag. 0.40E+00   43.8     0.07787233     -2202.1884998956 -2.64E-01
     9 DIIS/Diag.  0.20E-03   43.7     0.04803793     -2202.3465735603 -1.58E-01
    10 DIIS/Diag.  0.17E-04   43.6     0.00005206     -2202.5836419682 -2.37E-01
    11 DIIS/Diag.  0.14E-04   43.7     0.00001917     -2202.5836419845 -1.64E-08
    12 DIIS/Diag.  0.20E-04   43.7     0.00001266     -2202.5836419833  1.27E-09
    13 DIIS/Diag.  0.12E-05   43.7     0.00000152     -2202.5836419863 -3.08E-09

  *** SCF run converged in    13 steps ***


  Electronic density on regular grids:      -1023.9963829224        0.0036170776
  Core density on regular grids:             1023.9974460717       -0.0025539283
  Total charge density on r-space grids:        0.0010631494
  Total charge density g-space grids:           0.0010631494

  Overlap energy of the core charge distribution:               0.00001125202266
  Self energy of the core charge distribution:              -5610.60998987709900
  Core Hamiltonian energy:                                   1652.21912796279980
  Hartree energy:                                            2289.02439878388213
  Exchange-correlation energy:                               -533.21719010794914

  Total energy:                                             -2202.58364198634354

RMeli · 2023-09-04T12:53:08Z

Thanks @albestro and @rasolca for the help in fixing this issue!

RMeli self-assigned this Aug 11, 2023

RMeli added the Type:Bug Something isn't working label Aug 11, 2023

msimberg added this to the release v0.2.0 milestone Aug 21, 2023

albestro mentioned this issue Aug 21, 2023

TridiagSolver: fix missing sort in the deflation #960

Merged

3 tasks

rasolca modified the milestones: release v0.2.0, Eigensolver in CP2K/Sirius Aug 30, 2023

RMeli closed this as completed Sep 4, 2023

RMeli mentioned this issue Sep 7, 2023

NaNs observed in reduction to band #974

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaNs observed in tridiagonal eigensolver after "bulkerification" #953

NaNs observed in tridiagonal eigensolver after "bulkerification" #953

RMeli commented Aug 11, 2023

albestro commented Aug 18, 2023

albestro commented Aug 21, 2023

albestro commented Aug 30, 2023

RMeli commented Sep 4, 2023

RMeli commented Sep 4, 2023

NaNs observed in tridiagonal eigensolver after "bulkerification" #953

NaNs observed in tridiagonal eigensolver after "bulkerification" #953

Comments

RMeli commented Aug 11, 2023

Investigation

Details

CP2K

H2O-32

H2O-64

H2O-128

DLA-Future

Tridiagonal eigensolver miniapp

Experiments with block sizes

Debugging

albestro commented Aug 18, 2023

Details

PRE-BULK vs MASTER (with bulk)

albestro commented Aug 21, 2023

albestro commented Aug 30, 2023

RMeli commented Sep 4, 2023

DLA-Future

ScaLAPACK

RMeli commented Sep 4, 2023