TridiagSolver: fix missing sort in the deflation #960

albestro · 2023-08-21T16:07:43Z

(probable fix for #953)

Deflation process might produce changes in the order of deflated eigenvalues, and the part taking care of keeping them sorted was not implemented in our version. So, deflated eigenvalues were not always sorted correctly and this ended up with wrong results or NaN values. (See dlaed2 for more details)

Thanks to @RMeli and @rasolca for the investigation and the support in fixing this.

In addition to the bug fix, this introduces also std::hypot for avoiding possible numerical errors in the deflation step (it has been preferred the cppstd one over the lapack dlapy2, without any strong reason).

Another change is about removal of unused GPU kernels related to this step (namely stablePartitionIndexOnDevice and related).

TODO:

Add some note/doc about the change
Improve test for stablePartitionIndexForDeflation
Open issue about improving tests for tridiag solver or at least add the check in miniapp_tridiag_solver

albestro · 2023-08-21T16:08:19Z

cscs-ci run

albestro · 2023-08-21T16:41:32Z

cscs-ci run

codecov-commenter · 2023-08-21T17:23:15Z

Codecov Report

Merging #960 (eedf028) into master (6084329) will increase coverage by 1.47%.
Report is 1 commits behind head on master.
The diff coverage is 100.00%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##           master     #960      +/-   ##
==========================================
+ Coverage   93.35%   94.83%   +1.47%     
==========================================
  Files         143      129      -14     
  Lines        8605     7795     -810     
  Branches     1103     1049      -54     
==========================================
- Hits         8033     7392     -641     
+ Misses        388      238     -150     
+ Partials      184      165      -19

Files Changed	Coverage Δ
include/dlaf/eigensolver/tridiag_solver/kernels.h	`100.00% <ø> (ø)`
include/dlaf/eigensolver/tridiag_solver/merge.h	`99.81% <100.00%> (+<0.01%)`	⬆️

... and 35 files with indirect coverage changes

it was probably unused since #819

include/dlaf/eigensolver/tridiag_solver/merge.h

Co-authored-by: Raffaele Solcà <[email protected]>

rasolca · 2023-08-30T09:29:08Z

cscs-ci run

albestro · 2023-08-30T12:55:01Z

TL;DR

👍 Using spack we were able to build CP2K with DLAF support using @RMeli's branch (thanks @mathieu for the valuable support!)
🕵️‍♂️ We were able to collect information about input files to use for testing with CP2K (thanks @rasolca)
🖥️ We tested just PizDaint-MC
🎉 We used H20-128 as testing configuration and all runs we did with DLAF as backend all reported the same total energy at all intermediate steps (up to ~1e-13), but ...
🤔 ... energy values we obtained in our runs differs from what @RMeli reported in the issue, so we are still not fully sure we used the right input configuration for CP2K.

Build CP2K

use CP2K Rocco's branch
modify CP2K spack package to enable DLAF backend
⚠️ intel-oneapi-mkl problem with dla-future, switch to intel-mkl
🪛 Intel MKL provides FFTW but it is not found by cmake. Adding cray-fftw as dep clashes with MKL because of a double provider for fftw-api. Manually changed the intel-mkl spack package to not provide fftw-api (comment the provides directive in it)
dla-future used is 202308/dev + PR#946 (band2trid + fixes(tag))

Test convergence H20-128

InputFile

ialberto@daint103:~/workspace/cp2k> git --no-pager diff H2O-128.inp
diff --git a/H2O-128.inp b/H2O-128.inp
index 53bf03706..6b2bb0761 100644
--- a/H2O-128.inp
+++ b/H2O-128.inp
@@ -8,23 +8,21 @@
       REL_CUTOFF 30
     &END MGRID
     &QS
-      EPS_DEFAULT 1.0E-12
+      # EPS_DEFAULT 1.0E-12
       WF_INTERPOLATION PS
       EXTRAPOLATION_ORDER 3
     &END QS
     &SCF
       SCF_GUESS ATOMIC
-      &OT ON
-        MINIMIZER DIIS
-      &END OT
+      &DIAGONALIZATION ON
+        ALGORITHM STANDARD
+      &END DIAGONALIZATION
     # SCF_GUESS        RESTART
     # EPS_SCF      1.0E-7
-
       &PRINT
         &RESTART OFF
         &END
       &END
-
     &END SCF
     &XC
       &XC_FUNCTIONAL Pade
@@ -434,14 +432,12 @@
 &END FORCE_EVAL
 &GLOBAL
   PROJECT H2O-128
-  RUN_TYPE MD
+  RUN_TYPE ENERGY
   PRINT_LEVEL LOW
+  # PREFERRED_DIAG_LIBRARY scalapack
+  PREFERRED_DIAG_LIBRARY dlaf
+  &FM
+    NCOL_BLOCKS 512
+    NROW_BLOCKS 512
+  &END FM
 &END GLOBAL
-&MOTION
-  &MD
-    ENSEMBLE NVE
-    STEPS 10
-    TIMESTEP 0.5
-    TEMPERATURE 300.0
-  &END MD
-&END MOTION

Scalapack vs DLAF @ PizDaint-MC

All runs converged and all steps reported the same total energy (up to ~1e-13).

Scalapack-192

OMP_NUM_THREADS=4 srun -u -o"h2o-128-scalapack.out" -n9 -c8 cp2k.psmp /project/csstaff/ialberto/workspace/cp2k/H2O-128-scalapack.inp

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   29.2     1.18531859     -2188.1728046123 -2.19E+03
     2 P_Mix/Diag. 0.40E+00   18.6     0.67180775     -2193.8258043430 -5.65E+00
     3 P_Mix/Diag. 0.40E+00   18.5     0.40258064     -2197.2271742605 -3.40E+00
     4 P_Mix/Diag. 0.40E+00   18.4     0.24022917     -2199.2362551350 -2.01E+00
     5 P_Mix/Diag. 0.40E+00   18.4     0.14394355     -2200.4323299412 -1.20E+00
     6 P_Mix/Diag. 0.40E+00   18.4     0.08615989     -2201.1473157800 -7.15E-01
     7 DIIS/Diag.  0.64E-03   18.4     0.05174404     -2201.5755625833 -4.28E-01
     8 DIIS/Diag.  0.12E-03   18.4     0.00040741     -2202.2172419816 -6.42E-01
     9 DIIS/Diag.  0.19E-03   18.5     0.00024484     -2202.2172427207 -7.39E-07
    10 DIIS/Diag.  0.14E-03   18.5     0.00010363     -2202.2172428734 -1.53E-07
    11 DIIS/Diag.  0.26E-04   18.4     0.00002311     -2202.2172429911 -1.18E-07
    12 DIIS/Diag.  0.16E-04   18.4     0.00001440     -2202.2172429954 -4.31E-09
    13 DIIS/Diag.  0.44E-05   18.4     0.00000294     -2202.2172429961 -7.28E-10

  *** SCF run converged in    13 steps ***


  Electronic density on regular grids:      -1023.9999953423        0.0000046577
  Core density on regular grids:             1023.9999999611       -0.0000000389
  Total charge density on r-space grids:        0.0000046188
  Total charge density g-space grids:           0.0000046188

  Overlap energy of the core charge distribution:               0.00001125202266
  Self energy of the core charge distribution:              -5610.60998987709900
  Core Hamiltonian energy:                                   1652.37418036097529
  Hartree energy:                                            2289.23206962205450
  Exchange-correlation energy:                               -533.21351435405154

  Total energy:                                             -2202.21724299609741

 ENERGY| Total FORCE_EVAL ( QS ) energy [a.u.]:            -2202.217242996097411

DLAF256 (RPN=9)

OMP_NUM_THREADS=4 srun -u -o"h2o-128-dlaf256.out" -n9 -c8 cp2k.psmp /project/csstaff/ialberto/workspace/cp2k/H2O-128-dlaf.inp

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   27.9     1.18531859     -2188.1728046123 -2.19E+03
     2 P_Mix/Diag. 0.40E+00   24.5     0.67180775     -2193.8258043430 -5.65E+00
     3 P_Mix/Diag. 0.40E+00   23.0     0.40258064     -2197.2271742605 -3.40E+00
     4 P_Mix/Diag. 0.40E+00   22.6     0.24022917     -2199.2362551350 -2.01E+00
     5 P_Mix/Diag. 0.40E+00   23.8     0.14394355     -2200.4323299412 -1.20E+00
     6 P_Mix/Diag. 0.40E+00   22.8     0.08615989     -2201.1473157800 -7.15E-01
     7 DIIS/Diag.  0.64E-03   22.3     0.05174404     -2201.5755625833 -4.28E-01
     8 DIIS/Diag.  0.12E-03   21.7     0.00040741     -2202.2172419816 -6.42E-01
     9 DIIS/Diag.  0.19E-03   21.2     0.00024484     -2202.2172427207 -7.39E-07
    10 DIIS/Diag.  0.14E-03   24.1     0.00010363     -2202.2172428734 -1.53E-07
    11 DIIS/Diag.  0.26E-04   21.2     0.00002311     -2202.2172429911 -1.18E-07
    12 DIIS/Diag.  0.16E-04   20.6     0.00001440     -2202.2172429954 -4.31E-09
    13 DIIS/Diag.  0.44E-05   23.8     0.00000294     -2202.2172429961 -7.28E-10

  *** SCF run converged in    13 steps ***


  Electronic density on regular grids:      -1023.9999953423        0.0000046577
  Core density on regular grids:             1023.9999999611       -0.0000000389
  Total charge density on r-space grids:        0.0000046188
  Total charge density g-space grids:           0.0000046188

  Overlap energy of the core charge distribution:               0.00001125202266
  Self energy of the core charge distribution:              -5610.60998987709900
  Core Hamiltonian energy:                                   1652.37418036097552
  Hartree energy:                                            2289.23206962205450
  Exchange-correlation energy:                               -533.21351435405143

  Total energy:                                             -2202.21724299609741

 ENERGY| Total FORCE_EVAL ( QS ) energy [a.u.]:            -2202.217242996097411

DLAF1024 (RPN=9)

OMP_NUM_THREADS=4 srun -u -o"h2o-128-dlaf1024.out" -n9 -c8 cp2k.psmp /project/csstaff/ialberto/workspace/cp2k/H2O-128-dlaf.inp

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   48.4     1.18531859     -2188.1728046123 -2.19E+03
     2 P_Mix/Diag. 0.40E+00   47.1     0.67180775     -2193.8258043430 -5.65E+00
     3 P_Mix/Diag. 0.40E+00   46.2     0.40258064     -2197.2271742605 -3.40E+00
     4 P_Mix/Diag. 0.40E+00   46.7     0.24022917     -2199.2362551350 -2.01E+00
     5 P_Mix/Diag. 0.40E+00   46.8     0.14394355     -2200.4323299412 -1.20E+00
     6 P_Mix/Diag. 0.40E+00   46.3     0.08615989     -2201.1473157800 -7.15E-01
     7 DIIS/Diag.  0.64E-03   42.7     0.05174404     -2201.5755625833 -4.28E-01
     8 DIIS/Diag.  0.12E-03   46.4     0.00040741     -2202.2172419816 -6.42E-01
     9 DIIS/Diag.  0.19E-03   45.3     0.00024484     -2202.2172427207 -7.39E-07
    10 DIIS/Diag.  0.14E-03   40.3     0.00010363     -2202.2172428734 -1.53E-07
    11 DIIS/Diag.  0.26E-04   44.8     0.00002311     -2202.2172429911 -1.18E-07
    12 DIIS/Diag.  0.16E-04   46.0     0.00001440     -2202.2172429954 -4.31E-09
    13 DIIS/Diag.  0.44E-05   43.3     0.00000294     -2202.2172429961 -7.27E-10

  *** SCF run converged in    13 steps ***


  Electronic density on regular grids:      -1023.9999953423        0.0000046577
  Core density on regular grids:             1023.9999999611       -0.0000000389
  Total charge density on r-space grids:        0.0000046188
  Total charge density g-space grids:           0.0000046188

  Overlap energy of the core charge distribution:               0.00001125202266
  Self energy of the core charge distribution:              -5610.60998987709900
  Core Hamiltonian energy:                                   1652.37418036097552
  Hartree energy:                                            2289.23206962205450
  Exchange-correlation energy:                               -533.21351435405143

  Total energy:                                             -2202.21724299609741

 ENERGY| Total FORCE_EVAL ( QS ) energy [a.u.]:            -2202.217242996097411

DLAF512 (RPN=2)

OMP_NUM_THREADS=18 srun -u -o"h2o-128-dlaf512-rpn2.out" -n2 -c36 cp2k.psmp /project/csstaff/ialberto/workspace/cp2k/H2O-128-dlaf.inp

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 P_Mix/Diag. 0.40E+00   12.6     1.18531859     -2188.1728046123 -2.19E+03
     2 P_Mix/Diag. 0.40E+00   15.2     0.67180775     -2193.8258043430 -5.65E+00
     3 P_Mix/Diag. 0.40E+00   14.9     0.40258064     -2197.2271742605 -3.40E+00
     4 P_Mix/Diag. 0.40E+00   14.9     0.24022917     -2199.2362551350 -2.01E+00
     5 P_Mix/Diag. 0.40E+00   15.0     0.14394355     -2200.4323299412 -1.20E+00
     6 P_Mix/Diag. 0.40E+00   15.0     0.08615989     -2201.1473157800 -7.15E-01
     7 DIIS/Diag.  0.64E-03   15.2     0.05174404     -2201.5755625833 -4.28E-01
     8 DIIS/Diag.  0.12E-03   15.1     0.00040741     -2202.2172419816 -6.42E-01
     9 DIIS/Diag.  0.19E-03   15.1     0.00024484     -2202.2172427207 -7.39E-07
    10 DIIS/Diag.  0.14E-03   15.1     0.00010363     -2202.2172428734 -1.53E-07
    11 DIIS/Diag.  0.26E-04   15.1     0.00002311     -2202.2172429911 -1.18E-07
    12 DIIS/Diag.  0.16E-04   15.1     0.00001440     -2202.2172429954 -4.30E-09
    13 DIIS/Diag.  0.44E-05   15.1     0.00000294     -2202.2172429961 -7.29E-10

  *** SCF run converged in    13 steps ***


  Electronic density on regular grids:      -1023.9999953423        0.0000046577
  Core density on regular grids:             1023.9999999611       -0.0000000389
  Total charge density on r-space grids:        0.0000046188
  Total charge density g-space grids:           0.0000046188

  Overlap energy of the core charge distribution:               0.00001125202266
  Self energy of the core charge distribution:              -5610.60998987709900
  Core Hamiltonian energy:                                   1652.37418036097779
  Hartree energy:                                            2289.23206962205495
  Exchange-correlation energy:                               -533.21351435405131

  Total energy:                                             -2202.21724299609468

 ENERGY| Total FORCE_EVAL ( QS ) energy [a.u.]:            -2202.217242996094683

albestro added this to the release v0.2.0 milestone Aug 21, 2023

albestro requested review from rasolca and RMeli August 21, 2023 16:07

albestro self-assigned this Aug 21, 2023

albestro added Type:Bug Something isn't working Priority:High labels Aug 21, 2023

albestro force-pushed the fix-tridiag-solver branch from dc7d176 to f79d0c5 Compare August 21, 2023 16:14

albestro changed the title ~~TridiagSolver: fix missing sort in the deflation bug~~ TridiagSolver: fix missing sort in the deflation Aug 21, 2023

albestro mentioned this pull request Aug 21, 2023

NaNs observed in tridiagonal eigensolver after "bulkerification" #953

Closed

albestro added a commit that referenced this pull request Aug 22, 2023

Develop: TridiagSolver: fix missing sort in the deflation (#960)

205a6fe

albestro added 6 commits August 24, 2023 17:04

bug fix: missing sort in the deflation

21ab257

refactor sorting

4e39ce7

use hypot avoiding possible numerical problems (as lapack does)

eb5a2c2

WIP: fix build problem in test (it should be improved)

5f15c59

remove unused GPU kernel for stablePartitionForDeflation

04b6420

it was probably unused since #819

add some doc

edeadf4

albestro force-pushed the fix-tridiag-solver branch 3 times, most recently from c781736 to 1012688 Compare August 25, 2023 06:52

small extension for the test

fcff6bf

albestro force-pushed the fix-tridiag-solver branch from 1012688 to fcff6bf Compare August 25, 2023 07:01

albestro marked this pull request as ready for review August 28, 2023 12:22

rasolca approved these changes Aug 29, 2023

View reviewed changes

include/dlaf/eigensolver/tridiag_solver/merge.h Outdated Show resolved Hide resolved

include/dlaf/eigensolver/tridiag_solver/merge.h Outdated Show resolved Hide resolved

albestro and others added 2 commits August 29, 2023 10:33

Update include/dlaf/eigensolver/tridiag_solver/merge.h

68135c1

Co-authored-by: Raffaele Solcà <[email protected]>

Update include/dlaf/eigensolver/tridiag_solver/merge.h

eedf028

Co-authored-by: Raffaele Solcà <[email protected]>

rasolca merged commit 7f96b89 into master Aug 30, 2023
3 checks passed

rasolca deleted the fix-tridiag-solver branch August 30, 2023 13:25

github-actions bot pushed a commit that referenced this pull request Aug 30, 2023

Doc TridiagSolver: fix missing sort in the deflation (#960)

04d82f3

albestro mentioned this pull request Aug 30, 2023

Extend and/or improve test for eigensolvers (evp, gevp, tridsolver) #963

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TridiagSolver: fix missing sort in the deflation #960

TridiagSolver: fix missing sort in the deflation #960

albestro commented Aug 21, 2023 •

edited

Loading

albestro commented Aug 21, 2023

albestro commented Aug 21, 2023

codecov-commenter commented Aug 21, 2023 •

edited

Loading

rasolca commented Aug 30, 2023

albestro commented Aug 30, 2023

TridiagSolver: fix missing sort in the deflation #960

TridiagSolver: fix missing sort in the deflation #960

Conversation

albestro commented Aug 21, 2023 • edited Loading

albestro commented Aug 21, 2023

albestro commented Aug 21, 2023

codecov-commenter commented Aug 21, 2023 • edited Loading

Codecov Report

rasolca commented Aug 30, 2023

albestro commented Aug 30, 2023

TL;DR

Build CP2K

Test convergence H20-128

InputFile

Scalapack vs DLAF @ PizDaint-MC

Scalapack-192

DLAF256 (RPN=9)

DLAF1024 (RPN=9)

DLAF512 (RPN=2)

albestro commented Aug 21, 2023 •

edited

Loading

codecov-commenter commented Aug 21, 2023 •

edited

Loading