PwBaseCalculations number of electron_maxstep is fixed and repeat the same simulation #961

AndresOrtegaGuerrero · 2023-07-19T12:44:08Z

When a calculation doesn't reach convergence within the number used in the workchain (80 steps) it does a similar restart.
Repeating the same parameters , this could be up to 4 times.
Perhaps at a second stage (restart) the workchain should just be kill to avoid repeating unsuccessful calculations.
On the other hand, should we consider if the system might need some extra steps? Should the restart try to include 50% more steps and if it doesn't success should it kill the job?

mbercx · 2023-07-21T08:43:18Z

Thanks for raising the issue @AndresOrtegaGuerrero! Note that in case the electronic convergence fails, the PwBaseWorkChain reduces the mixing_beta :

aiida-quantumespresso/src/aiida_quantumespresso/workflows/pw/base.py

Lines 559 to 576 in 5d4a7d9

    
               @process_handler(priority=410, exit_codes=[ 
        
                   PwCalculation.exit_codes.ERROR_ELECTRONIC_CONVERGENCE_NOT_REACHED, 
        
               ]) 
        
               def handle_electronic_convergence_not_reached(self, calculation): 
        
                   """Handle `ERROR_ELECTRONIC_CONVERGENCE_NOT_REACHED` error. 
        
                   Decrease the mixing beta and fully restart from the previous calculation. 
        
                   """ 
        
                   factor = self.defaults.delta_factor_mixing_beta 
        
                   mixing_beta = self.ctx.inputs.parameters.get('ELECTRONS', {}).get('mixing_beta', self.defaults.qe.mixing_beta) 
        
                   mixing_beta_new = mixing_beta * factor 
        
                   self.ctx.inputs.parameters['ELECTRONS']['mixing_beta'] = mixing_beta_new 
        
                   action = f'reduced beta mixing from {mixing_beta} to {mixing_beta_new} and restarting from the last calculation' 
        
                   self.set_restart_type(RestartType.FULL, calculation.outputs.remote_folder) 
        
                   self.report_error_handled(calculation, action) 
        
                   return ProcessHandlerReport(True)

However, we have already discussed in the past that:

The current factor for reducing the mixing_beta is too high:

aiida-quantumespresso/src/aiida_quantumespresso/workflows/pw/base.py

Line 32 in 5d4a7d9

'delta_factor_mixing_beta': 0.8,

Perhaps it only makes sense to try this once.

Increasing the number of electronic steps electron_maxstep might also be sensible, but perhaps only in case we see that the SCF is converging?

AndresOrtegaGuerrero · 2023-07-21T09:11:58Z

@mbercx I like the idea that it checks the output and it takes a decision upon the scf convergence and/change. But definitively to not do 4 times a restart that will end up in failure.

t-reents · 2023-07-21T09:15:34Z

Hi together! @mbercx Just out of curiosity, have you also discussed to include mixing_ndim in the past? I included this parameter in my own workchains on top of the plugin. It will be also increased in steps of 4 (up to a given maximum) in case the mixing_beta is already quite small. I think that I remember some calculations where it seems that this actually helped. Of course, increasing mixing_ndim also increases the memory usage so this might be the reason why this is not included per default. If I remember correctly, increasing mixing_ndim and decreasing mixing_beta might increase the necessary steps to converge and for that reason I also increase electron_maxstep in case mixing_ndim is increased.

mbercx · 2023-07-21T09:23:13Z

But definitively to not do 4 times a restart that will end up in failure.

Yeah, I've also done some tests for this. It's rare that it suddenly works after e.g. 2 restarts.

Just out of curiosity, have you also discussed to include mixing_ndim in the past?

Thanks @t-reents! I haven't played around with this setting much, to be honest, as the default (8) already seemed quite high? But it may be worth testing.

Another approach which we still have to fully integrate here is the "direct minimization" method implemented in SIRIUS by @ simonpintarelli:

https://github.com/simonpintarelli/nlcglib

I've tested the robustness of this alternative approach of finding the electronic ground state quite extensively in the past, and it is very promising.

AndresOrtegaGuerrero · 2023-07-21T09:33:00Z

@cpignedoli Could you share also your experience on these restarts ?

bastonero · 2023-11-24T00:10:03Z

Hello everyone, I am also interested in this discussion as I would like to improve the handler too.
As I collected over ~3 years some data, I performed some "scientific" analysis on PwCalculation that finished correctly (exit status 0) and the one having issues with the electronic convergence (exit status 410).

In the following plot I report the counts for the slope of the convergence, computed using a linear fit (steps vs. log(estimated scf convergence)).
convergence_statistics.pdf
You can reproduce with the snippet below. It would be good to collect some more statistics, especially on the failed ones. My statistics tells that the failed run have slopes ~ -0.1 or greater, whereas -0.2 starts to get borderline, as you can see, while lower values mean convergence will be achieved.

TAKE-HOME MESSAGE: as a rule of thumb one can compute the slope, and if it is lower than -0.1/-0.2, then it is probably worth it to increase the steps, otherwise don't and try a different strategy (e.g. change mixing mode, mixing beta, or something else).

from aiida import load_profile
from aiida.orm import *
import numpy as np
import matplotlib.pyplot as plt

load_profile()

MAX_NODE_COUNT = 10000

def linear(x,*args):
    return args[0] +args[1]*x

def fit_value(kn, array, guess):
    from scipy.optimize import curve_fit
    params, cov = curve_fit(linear, kn, array, p0=[1, (array[0]-array[-1])/(kn[0]-kn[-1])])
    
    return params, cov

q = QueryBuilder()
q.append(CalcJobNode, filters={
    'attributes.exit_status': {'in':[0]},
    'attributes.process_state': 'finished',
    'attributes.process_label': 'PwCalculation',
})
print("Tot. PwCalculation: ", q.count())

slopes_ok = []

count = 0
for n in q.iterall():
    try:
        y = n[0].tools.get_scf_accuracy(0)
        x = np.array(list(range(len(y))))+1
        params, cov = fit_value(x, np.log(y), 0)
        slopes_ok.append(params[1])
    except:
        pass
    count+=1
    if count > MAX_NODE_COUNT:
        break

q = QueryBuilder()
q.append(CalcJobNode, filters={
    'attributes.exit_status': {'in':[410]},
    'attributes.process_state': 'finished',
    'attributes.process_label': 'PwCalculation',
})
print("Tot. PwCalculation: ", q.count())

slopes_fail = []

count = 0
for n in q.iterall():
    try:
        y = n[0].tools.get_scf_accuracy(0)
        x = np.array(list(range(len(y))))+1
        params, cov = fit_value(x, np.log(y), 0)
        slopes_fail.append(params[1])
    except:
        pass
    count+=1
    if count > MAX_NODE_COUNT:
        break

plt.hist(slopes_ok, bins=120, label='Converged')
plt.hist(slopes_fail, bins=120, label='Failed')
plt.xlabel('Number of SCF')
plt.ylabel('Convergence slope')
plt.legend()
plt.savefig('./convergence_statistics.pdf', dpi=300, transparent = True, pad_inches = .1, bbox_inches = 'tight')

mbercx · 2023-11-24T07:22:08Z

Nice, thanks @bastonero! Once I get to my computer I'll check for the MC3D runs, that should give us some more statistics ^^

mbercx · 2023-11-24T12:09:36Z

Statistics on the MC3D rSCF runs (first iterations (mixing_beta = 0.4), structures obtained directly from the databases, lanthanides have been avoided here):

I was also curious about the number of SCF steps required for convergence for the successful runs:

And had a closer look at the slope for those that needed more than 50 SCF steps

The vast majority of the structures that converge do so within 50 SCF steps. For those that don't, looking at the image above, they are very likely to still converge within the next 30 SCF steps if the slope is smaller than -0.1. Note that of course there may still be others that converge if we would have run with more steps.

So I would:

Limit the default number of maximum steps (electron_maxstep) to 50.
In case of SCF convergence failure, check the slope. If < -0.1, do a full restart (i.e. from wave functions/charge density; not sure if we need more steps then?).
Else reduce mixing_beta.

For (3), I would only try this maybe once or twice. We now start with 0.4, so maybe we can set delta_factor_mixing_beta to 0.5 and not let mixing_beta go below 0.1.

Script to generate images

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from typing import Optional

import matplotlib.pyplot as plt
import numpy as np
import typer
from aiida import load_profile, orm
from aiida_quantumespresso.calculations.pw import PwCalculation
from rich import print
from rich.progress import track
from scipy.optimize import curve_fit
import numpy as np

load_profile("prod")

app = typer.Typer()


def get_slope_scipy(kn, array):
    """Perform linear fit using scipy's curve_fit function"""
    def linear(x,*args):
        return args[0] +args[1]*x

    params, _ = curve_fit(linear, kn, array, p0=[1, (array[0]-array[-1])/(kn[0]-kn[-1])])
    
    return params[1]


def get_slope_numpy(x, y):
    """Perform linear fit using numpy's polyfit function"""
    return np.polyfit(x, y, 1)[0]


@app.command()
def slopes(pwbase_group, min_nstep: Optional[int] = None, limit: int = None, method: str = "numpy"):
    """Sort the slopes for successful and failed calculations."""

    query = orm.QueryBuilder()

    query.append(orm.Group, filters={"label": pwbase_group}, tag="group").append(
        orm.WorkChainNode, with_group="group", tag="workchain"
    ).append(
        PwCalculation,
        tag="pw",
        edge_filters={"label": "iteration_01"},
        filters={"attributes.exit_status": {"in": [0, 410]}},
        project='*'
    )
    if min_nstep is not None:
        query.append(
            orm.Dict,
            with_incoming="pw",
            edge_filters={"label": "output_parameters"},
            filters={
                "attributes.convergence_info.scf_conv.n_scf_steps": {">": min_nstep}
            },
        )
    if limit is not None:
        query.limit(limit)

    number = query.count()
    print(f"[bold blue]Report:[/] Number of completed pw.x: {number}")

    slopes_ok = []
    slopes_fail = []
    errors = []

    for [pw_calculation] in track(
        query.iterall(), total=number, description="Checking slopes"
    ):
        try:
            y = pw_calculation.tools.get_scf_accuracy(0)
            x = np.arange(len(y))

            if method == "numpy":
                slope = get_slope_numpy(x, np.log(y))
            elif method == "scipy":
                slope = get_slope_scipy(x, np.log(y))

            if pw_calculation.exit_status == 0:
                slopes_ok.append(slope)
            else:
                slopes_fail.append(slope)

        except Exception as e:
            errors.append((pw_calculation.pk, e))

    plt.hist(slopes_ok, bins=120, label="Converged", alpha=0.6)
    plt.hist(slopes_fail, bins=120, label="Failed", alpha=0.6)

    plt.xlabel("Convergence slope")
    plt.ylabel("Number of SCF")
    plt.legend()

    if min_nstep is not None:
        filename = f"slopes_{min_nstep}.png"
    else:
        filename = "slopes.png"

    plt.savefig(
        filename, dpi=300, pad_inches=0.1, bbox_inches="tight"
    )


@app.command()
def steps(pwbase_group, max_nstep: int = 50, limit: int = None):
    """How many steps are needed to converge?"""

    query = orm.QueryBuilder()

    query.append(orm.Group, filters={"label": pwbase_group}, tag="group").append(
        orm.WorkChainNode, with_group="group", tag="workchain"
    ).append(
        PwCalculation,
        tag="pw",
        edge_filters={"label": "iteration_01"},
        filters={"attributes.exit_status": 0},
        project="attributes.exit_status",
    ).append(
        orm.Dict,
        with_incoming="pw",
        edge_filters={"label": "output_parameters"},
        project="attributes.convergence_info.scf_conv.n_scf_steps",
    )
    if limit is not None:
        query.limit(limit)

    number = query.count()
    print(f"[bold blue]Report:[/] Number of completed pw.x: {number}")

    n_scf_list = []

    for exit_status, n_scf_steps in track(
        query.iterall(), total=number, description="Gathering steps"
    ):
        if exit_status == 0:
            n_scf_list.append(n_scf_steps)

    plt.hist(n_scf_list, bins=40)
    plt.xlabel("Number of SCF Steps")
    plt.ylabel("Frequency")

    plt.axvline(x=max_nstep, color="red")

    struct_perc = round(
        sum([1 for i in n_scf_list if i <= max_nstep]) / number * 100, 1
    )

    plt.text(
        max_nstep,
        plt.ylim()[1] + 0.05 * (plt.ylim()[1] - plt.ylim()[0]),
        f"{struct_perc}% of structures",
        color="red",
        verticalalignment="top",
        horizontalalignment="right",
    )
    plt.savefig(
        "./steps.png", dpi=300, pad_inches=0.1, bbox_inches="tight"
    )


if __name__ == "__main__":
    app()

The current error handler for the `PwBaseWorkChain` that deals with SCF convergence issues only tries to reduce the `mixing-beta` by 20% each time the calculation fails, and doesn't consider the rate of convergence. Here we improve the error handling for this case by calculating the slope on the logarithm of the "scf accuracy". Based on some statistics presented in aiidateam#961 We noted two points: 1. Most of the calculations that converge do so within 50 SCF steps. 2. For those that don't, there is a very low chance of convergence in case the scf accuracy slope is higher than -0.1. Hence, we: * Limit the default number of maximum steps (`electron_maxstep`) to 50. * In case of SCF convergence failure, check the scf accuracy slope. If < -0.1, do a full restart. * Else, check the `mixing_beta`. If above 0.1, half it and restart the calculation.

bastonero · 2023-11-24T13:05:33Z

Thanks @mbercx, this is a great analysis! Indeed, -0.1 seems to set a good threshold.
One thing to consider is that one can also extrapolate the number of max steps to reahc a certain conv_thr.
From the fit, one has:
log(conv_thr) = a + b*nsteps
This means:
nsteps = [ log(conv_thr) - a] / b
In pratice, usually a = ~2 and conv_thr = 1.0e-P (P = 8~12). Hence:
nsteps = -(P+2)/b ==> P=12: nsteps ~ -14 / b ==> b=-1: nsteps ~ 14
Which seems the vast majority of you cases. From here, one can then still extrapolate from the slop the remaining steps neeeded, and set a maximum step after which we don't think it's worth it. For example max_steps ~ 300.
For b~=0.2 ==> nsteps ~ 70
So one can in still in principle take up to b~-0.05, for which nsteps ~300. But probably then it's more convenient to use an other strategy.
For example:

Change mixing_mode: local-TF, or TF (the latter not sure when to use it`
Increase mixing_ndim: default is 8, but maybe increase to 12 or 20 (more memory usage, but nowadays this won't really be a problem for actual calculations on HPCs)

I think these two are the only resort in QE, a part from changing the mixing_beta. I experienced some times that increasing e.g. cutoff and kpoints helped really a lot. But shall we authorize ourself to touch these parameters?

mbercx · 2023-11-24T13:14:26Z

Thanks @bastonero! Adding some extrapolation is not a bad idea, will do so in #987.

Re the next strategy, currently we only adapt mixing_beta. As mentioned above, I've never really played with mixing_ndim. Is one preferable over the other in your experience?

But shall we authorize ourself to touch these parameters?

To this I would say no, since the could substantially influence the results (and resource consumption ^^). Moreover, if someone is testing convergence etc one would definitely not want that.

As a final note, I've recently added a bunch of failures to the following repository:

https://github.com/mbercx/qe-issues

The plan was to give these to Baroni & co and have them make suggestions.

I'll definitely unleach our new strategy on these before merging the PR.
If you have any problematic cases to add, feel free to open a PR to the issues repo. I just selected ~100 ones for smaller systems for testing.

bastonero · 2023-11-24T13:44:11Z

mixing_ndim sometimes helped in conjunction with changing mixing_beta but also mixing_mode. Possibly one can try in sequence:

Change mixing_beta
Change mixing_mode (to local-TF)
Change again mixing_ndim and/or mixing_beta
As point 3.

This is at least what I think I would try to do. But proper testing on actual problematic cases is a more accurate solution for sure.

Awesome re the issues! I saw there a lot of magnetic cases there. I feel those are the trickiest for sure, but don't have a good solution (probably having a physical intuition for a good starting magnetization can definitely help; nevertheless, sometimes larger steps are needed before the slope decreases sensibly - but sometime it's just luck).

mbercx · 2023-11-24T14:07:54Z

But proper testing on actual problematic cases is a more accurate solution for sure.

Indeed. I'm trying to gather more lower-dimensional cases in https://github.com/mbercx/qe-issues. I think there was also a difficult structure set somewhere we can look into, but I forget where. Will try and remember. ^^

I feel those are the trickiest for sure, but don't have a good solution

So all those did succeed with the direct minimization implemented in nlcglib for SIRIUS. Another thing on the TODO list is to finally properly integrate SIRIUS with the plugin, but there are still some minor things here and there to resolve before that, e.g.:

electronic-structure/q-e-sirius#45

bastonero · 2023-11-24T15:31:16Z

Yeah, this direct minimization approach sounds sweet !

bastonero · 2023-11-24T16:05:38Z

FYI from QE user guide:

Self-consistency is slow or does not converge at all Bad input data will often result in
bad scf convergence. Please carefully check your structure first, e.g. using XCrySDen.
Assuming that your input data is sensible :

Verify if your system is metallic or is close to a metallic state, especially if you have few
k-points. If the highest occupied and lowest unoccupied state(s) keep exchanging place
during self-consistency, forget about reaching convergence. A typical sign of such behavior
is that the self-consistency error goes down, down, down, than all of a sudden up again,
and so on. Usually one can solve the problem by adding a few empty bands and a small
broadening.

Reduce mixing beta to ∼ 0.3 ÷ 0.1 or smaller. Try the mixing mode value that is more
appropriate for your problem. For slab geometries used in surface problems or for elongated cells, mixing mode=’local-TF’ should be the better choice, dampening ”charge
sloshing”. You may also try to increase mixing ndim to more than 8 (default value).
Beware: this will increase the amount of memory you need.

Specific to USPP: the presence of negative charge density regions due to either the
pseudization procedure of the augmentation part or to truncation at finite cutoff may
give convergence problems. Raising the ecutrho cutoff for charge density will usually
help.

ccigna · 2023-11-24T16:17:40Z

Hello all! Thanks a lot for your effort in this, it can be very helpful. Together with the already mentioned solution of changing the mixing_mode for anisotropic structure, and ndim, in my experience also changing the diagonalization to 'cg' can help a lot. It would be helpful in my opinion to understand what are the problems in 'davidson' algorithm in order to figure out in which cases would be worth to change to (slower) 'cg' .

The current error handler for the `PwBaseWorkChain` that deals with SCF convergence issues only tries to reduce the `mixing-beta` by 20% each time the calculation fails, and doesn't consider the rate of convergence. Here we improve the error handling for this case by calculating the slope on the logarithm of the "scf accuracy". Based on some statistics presented in aiidateam#961 We noted two points: 1. Most of the calculations that converge do so within 50 SCF steps. 2. For those that don't, there is a very low chance of convergence in case the scf accuracy slope is higher than -0.1. Hence, we: * Limit the default number of maximum steps (`electron_maxstep`) to 50. * In case of SCF convergence failure, check the scf accuracy slope. If < -0.1, do a full restart. * Else, check the `mixing_beta`. If above 0.1, half it and restart the calculation.

mbercx mentioned this issue Nov 24, 2023

PhBaseWorkChain: better handler for convergence #986

Open

mbercx linked a pull request Nov 24, 2023 that will close this issue

👌 PwBaseWorkChain: Improve handling of SCF convergence issues #987

Draft

2 tasks

bastonero added type/enhancement topic/workflows priority/important labels Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PwBaseCalculations number of electron_maxstep is fixed and repeat the same simulation #961

PwBaseCalculations number of electron_maxstep is fixed and repeat the same simulation #961

AndresOrtegaGuerrero commented Jul 19, 2023

mbercx commented Jul 21, 2023

AndresOrtegaGuerrero commented Jul 21, 2023

t-reents commented Jul 21, 2023

mbercx commented Jul 21, 2023

AndresOrtegaGuerrero commented Jul 21, 2023

bastonero commented Nov 24, 2023

mbercx commented Nov 24, 2023

mbercx commented Nov 24, 2023

bastonero commented Nov 24, 2023 •

edited

Loading

mbercx commented Nov 24, 2023

bastonero commented Nov 24, 2023

mbercx commented Nov 24, 2023

bastonero commented Nov 24, 2023

bastonero commented Nov 24, 2023

ccigna commented Nov 24, 2023

PwBaseCalculations number of electron_maxstep is fixed and repeat the same simulation #961

PwBaseCalculations number of electron_maxstep is fixed and repeat the same simulation #961

Comments

AndresOrtegaGuerrero commented Jul 19, 2023

mbercx commented Jul 21, 2023

AndresOrtegaGuerrero commented Jul 21, 2023

t-reents commented Jul 21, 2023

mbercx commented Jul 21, 2023

AndresOrtegaGuerrero commented Jul 21, 2023

bastonero commented Nov 24, 2023

mbercx commented Nov 24, 2023

mbercx commented Nov 24, 2023

bastonero commented Nov 24, 2023 • edited Loading

mbercx commented Nov 24, 2023

bastonero commented Nov 24, 2023

mbercx commented Nov 24, 2023

bastonero commented Nov 24, 2023

bastonero commented Nov 24, 2023

ccigna commented Nov 24, 2023

bastonero commented Nov 24, 2023 •

edited

Loading