Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PwBaseCalculations number of electron_maxstep is fixed and repeat the same simulation #961

Open
AndresOrtegaGuerrero opened this issue Jul 19, 2023 · 15 comments · May be fixed by #987
Open

Comments

@AndresOrtegaGuerrero
Copy link
Contributor

When a calculation doesn't reach convergence within the number used in the workchain (80 steps) it does a similar restart.
Repeating the same parameters , this could be up to 4 times.
Perhaps at a second stage (restart) the workchain should just be kill to avoid repeating unsuccessful calculations.
On the other hand, should we consider if the system might need some extra steps? Should the restart try to include 50% more steps and if it doesn't success should it kill the job?

@mbercx
Copy link
Member

mbercx commented Jul 21, 2023

Thanks for raising the issue @AndresOrtegaGuerrero! Note that in case the electronic convergence fails, the PwBaseWorkChain reduces the mixing_beta :

@process_handler(priority=410, exit_codes=[
PwCalculation.exit_codes.ERROR_ELECTRONIC_CONVERGENCE_NOT_REACHED,
])
def handle_electronic_convergence_not_reached(self, calculation):
"""Handle `ERROR_ELECTRONIC_CONVERGENCE_NOT_REACHED` error.
Decrease the mixing beta and fully restart from the previous calculation.
"""
factor = self.defaults.delta_factor_mixing_beta
mixing_beta = self.ctx.inputs.parameters.get('ELECTRONS', {}).get('mixing_beta', self.defaults.qe.mixing_beta)
mixing_beta_new = mixing_beta * factor
self.ctx.inputs.parameters['ELECTRONS']['mixing_beta'] = mixing_beta_new
action = f'reduced beta mixing from {mixing_beta} to {mixing_beta_new} and restarting from the last calculation'
self.set_restart_type(RestartType.FULL, calculation.outputs.remote_folder)
self.report_error_handled(calculation, action)
return ProcessHandlerReport(True)

However, we have already discussed in the past that:

  1. The current factor for reducing the mixing_beta is too high:

  1. Perhaps it only makes sense to try this once.

Increasing the number of electronic steps electron_maxstep might also be sensible, but perhaps only in case we see that the SCF is converging?

@AndresOrtegaGuerrero
Copy link
Contributor Author

@mbercx I like the idea that it checks the output and it takes a decision upon the scf convergence and/change. But definitively to not do 4 times a restart that will end up in failure.

@t-reents
Copy link

Hi together! @mbercx Just out of curiosity, have you also discussed to include mixing_ndim in the past? I included this parameter in my own workchains on top of the plugin. It will be also increased in steps of 4 (up to a given maximum) in case the mixing_beta is already quite small. I think that I remember some calculations where it seems that this actually helped. Of course, increasing mixing_ndim also increases the memory usage so this might be the reason why this is not included per default. If I remember correctly, increasing mixing_ndim and decreasing mixing_beta might increase the necessary steps to converge and for that reason I also increase electron_maxstep in case mixing_ndim is increased.

@mbercx
Copy link
Member

mbercx commented Jul 21, 2023

But definitively to not do 4 times a restart that will end up in failure.

Yeah, I've also done some tests for this. It's rare that it suddenly works after e.g. 2 restarts.

Just out of curiosity, have you also discussed to include mixing_ndim in the past?

Thanks @t-reents! I haven't played around with this setting much, to be honest, as the default (8) already seemed quite high? But it may be worth testing.

Another approach which we still have to fully integrate here is the "direct minimization" method implemented in SIRIUS by @ simonpintarelli:

https://github.com/simonpintarelli/nlcglib

I've tested the robustness of this alternative approach of finding the electronic ground state quite extensively in the past, and it is very promising.

@AndresOrtegaGuerrero
Copy link
Contributor Author

@cpignedoli Could you share also your experience on these restarts ?

@bastonero
Copy link
Collaborator

Hello everyone, I am also interested in this discussion as I would like to improve the handler too.
As I collected over ~3 years some data, I performed some "scientific" analysis on PwCalculation that finished correctly (exit status 0) and the one having issues with the electronic convergence (exit status 410).

In the following plot I report the counts for the slope of the convergence, computed using a linear fit (steps vs. log(estimated scf convergence)).
convergence_statistics.pdf
You can reproduce with the snippet below. It would be good to collect some more statistics, especially on the failed ones. My statistics tells that the failed run have slopes ~ -0.1 or greater, whereas -0.2 starts to get borderline, as you can see, while lower values mean convergence will be achieved.

TAKE-HOME MESSAGE: as a rule of thumb one can compute the slope, and if it is lower than -0.1/-0.2, then it is probably worth it to increase the steps, otherwise don't and try a different strategy (e.g. change mixing mode, mixing beta, or something else).

from aiida import load_profile
from aiida.orm import *
import numpy as np
import matplotlib.pyplot as plt

load_profile()

MAX_NODE_COUNT = 10000

def linear(x,*args):
    return args[0] +args[1]*x

def fit_value(kn, array, guess):
    from scipy.optimize import curve_fit
    params, cov = curve_fit(linear, kn, array, p0=[1, (array[0]-array[-1])/(kn[0]-kn[-1])])
    
    return params, cov

q = QueryBuilder()
q.append(CalcJobNode, filters={
    'attributes.exit_status': {'in':[0]},
    'attributes.process_state': 'finished',
    'attributes.process_label': 'PwCalculation',
})
print("Tot. PwCalculation: ", q.count())

slopes_ok = []

count = 0
for n in q.iterall():
    try:
        y = n[0].tools.get_scf_accuracy(0)
        x = np.array(list(range(len(y))))+1
        params, cov = fit_value(x, np.log(y), 0)
        slopes_ok.append(params[1])
    except:
        pass
    count+=1
    if count > MAX_NODE_COUNT:
        break

q = QueryBuilder()
q.append(CalcJobNode, filters={
    'attributes.exit_status': {'in':[410]},
    'attributes.process_state': 'finished',
    'attributes.process_label': 'PwCalculation',
})
print("Tot. PwCalculation: ", q.count())

slopes_fail = []

count = 0
for n in q.iterall():
    try:
        y = n[0].tools.get_scf_accuracy(0)
        x = np.array(list(range(len(y))))+1
        params, cov = fit_value(x, np.log(y), 0)
        slopes_fail.append(params[1])
    except:
        pass
    count+=1
    if count > MAX_NODE_COUNT:
        break

plt.hist(slopes_ok, bins=120, label='Converged')
plt.hist(slopes_fail, bins=120, label='Failed')
plt.xlabel('Number of SCF')
plt.ylabel('Convergence slope')
plt.legend()
plt.savefig('./convergence_statistics.pdf', dpi=300, transparent = True, pad_inches = .1, bbox_inches = 'tight')

@mbercx
Copy link
Member

mbercx commented Nov 24, 2023

Nice, thanks @bastonero! Once I get to my computer I'll check for the MC3D runs, that should give us some more statistics ^^

@mbercx
Copy link
Member

mbercx commented Nov 24, 2023

Statistics on the MC3D rSCF runs (first iterations (mixing_beta = 0.4), structures obtained directly from the databases, lanthanides have been avoided here):

image

I was also curious about the number of SCF steps required for convergence for the successful runs:

image

And had a closer look at the slope for those that needed more than 50 SCF steps

image

The vast majority of the structures that converge do so within 50 SCF steps. For those that don't, looking at the image above, they are very likely to still converge within the next 30 SCF steps if the slope is smaller than -0.1. Note that of course there may still be others that converge if we would have run with more steps.

So I would:

  1. Limit the default number of maximum steps (electron_maxstep) to 50.
  2. In case of SCF convergence failure, check the slope. If < -0.1, do a full restart (i.e. from wave functions/charge density; not sure if we need more steps then?).
  3. Else reduce mixing_beta.

For (3), I would only try this maybe once or twice. We now start with 0.4, so maybe we can set delta_factor_mixing_beta to 0.5 and not let mixing_beta go below 0.1.

Script to generate images
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from typing import Optional

import matplotlib.pyplot as plt
import numpy as np
import typer
from aiida import load_profile, orm
from aiida_quantumespresso.calculations.pw import PwCalculation
from rich import print
from rich.progress import track
from scipy.optimize import curve_fit
import numpy as np

load_profile("prod")

app = typer.Typer()


def get_slope_scipy(kn, array):
    """Perform linear fit using scipy's curve_fit function"""
    def linear(x,*args):
        return args[0] +args[1]*x

    params, _ = curve_fit(linear, kn, array, p0=[1, (array[0]-array[-1])/(kn[0]-kn[-1])])
    
    return params[1]


def get_slope_numpy(x, y):
    """Perform linear fit using numpy's polyfit function"""
    return np.polyfit(x, y, 1)[0]


@app.command()
def slopes(pwbase_group, min_nstep: Optional[int] = None, limit: int = None, method: str = "numpy"):
    """Sort the slopes for successful and failed calculations."""

    query = orm.QueryBuilder()

    query.append(orm.Group, filters={"label": pwbase_group}, tag="group").append(
        orm.WorkChainNode, with_group="group", tag="workchain"
    ).append(
        PwCalculation,
        tag="pw",
        edge_filters={"label": "iteration_01"},
        filters={"attributes.exit_status": {"in": [0, 410]}},
        project='*'
    )
    if min_nstep is not None:
        query.append(
            orm.Dict,
            with_incoming="pw",
            edge_filters={"label": "output_parameters"},
            filters={
                "attributes.convergence_info.scf_conv.n_scf_steps": {">": min_nstep}
            },
        )
    if limit is not None:
        query.limit(limit)

    number = query.count()
    print(f"[bold blue]Report:[/] Number of completed pw.x: {number}")

    slopes_ok = []
    slopes_fail = []
    errors = []

    for [pw_calculation] in track(
        query.iterall(), total=number, description="Checking slopes"
    ):
        try:
            y = pw_calculation.tools.get_scf_accuracy(0)
            x = np.arange(len(y))

            if method == "numpy":
                slope = get_slope_numpy(x, np.log(y))
            elif method == "scipy":
                slope = get_slope_scipy(x, np.log(y))

            if pw_calculation.exit_status == 0:
                slopes_ok.append(slope)
            else:
                slopes_fail.append(slope)

        except Exception as e:
            errors.append((pw_calculation.pk, e))

    plt.hist(slopes_ok, bins=120, label="Converged", alpha=0.6)
    plt.hist(slopes_fail, bins=120, label="Failed", alpha=0.6)

    plt.xlabel("Convergence slope")
    plt.ylabel("Number of SCF")
    plt.legend()

    if min_nstep is not None:
        filename = f"slopes_{min_nstep}.png"
    else:
        filename = "slopes.png"

    plt.savefig(
        filename, dpi=300, pad_inches=0.1, bbox_inches="tight"
    )


@app.command()
def steps(pwbase_group, max_nstep: int = 50, limit: int = None):
    """How many steps are needed to converge?"""

    query = orm.QueryBuilder()

    query.append(orm.Group, filters={"label": pwbase_group}, tag="group").append(
        orm.WorkChainNode, with_group="group", tag="workchain"
    ).append(
        PwCalculation,
        tag="pw",
        edge_filters={"label": "iteration_01"},
        filters={"attributes.exit_status": 0},
        project="attributes.exit_status",
    ).append(
        orm.Dict,
        with_incoming="pw",
        edge_filters={"label": "output_parameters"},
        project="attributes.convergence_info.scf_conv.n_scf_steps",
    )
    if limit is not None:
        query.limit(limit)

    number = query.count()
    print(f"[bold blue]Report:[/] Number of completed pw.x: {number}")

    n_scf_list = []

    for exit_status, n_scf_steps in track(
        query.iterall(), total=number, description="Gathering steps"
    ):
        if exit_status == 0:
            n_scf_list.append(n_scf_steps)

    plt.hist(n_scf_list, bins=40)
    plt.xlabel("Number of SCF Steps")
    plt.ylabel("Frequency")

    plt.axvline(x=max_nstep, color="red")

    struct_perc = round(
        sum([1 for i in n_scf_list if i <= max_nstep]) / number * 100, 1
    )

    plt.text(
        max_nstep,
        plt.ylim()[1] + 0.05 * (plt.ylim()[1] - plt.ylim()[0]),
        f"{struct_perc}% of structures",
        color="red",
        verticalalignment="top",
        horizontalalignment="right",
    )
    plt.savefig(
        "./steps.png", dpi=300, pad_inches=0.1, bbox_inches="tight"
    )


if __name__ == "__main__":
    app()

mbercx added a commit to mbercx/aiida-quantumespresso that referenced this issue Nov 24, 2023
The current error handler for the `PwBaseWorkChain` that deals with SCF convergence
issues only tries to reduce the `mixing-beta` by 20% each time the calculation fails,
and doesn't consider the rate of convergence.

Here we improve the error handling for this case by calculating the slope on the
logarithm of the "scf accuracy". Based on some statistics presented in

aiidateam#961

We noted two points:

1. Most of the calculations that converge do so within 50 SCF steps.
2. For those that don't, there is a very low chance of convergence in case the scf
   accuracy slope is higher than -0.1.

Hence, we:

* Limit the default number of maximum steps (`electron_maxstep`) to 50.
* In case of SCF convergence failure, check the scf accuracy slope. If < -0.1, do a
  full restart.
* Else, check the `mixing_beta`. If above 0.1, half it and restart the calculation.
@mbercx mbercx linked a pull request Nov 24, 2023 that will close this issue
2 tasks
@bastonero
Copy link
Collaborator

bastonero commented Nov 24, 2023

Thanks @mbercx, this is a great analysis! Indeed, -0.1 seems to set a good threshold.
One thing to consider is that one can also extrapolate the number of max steps to reahc a certain conv_thr.
From the fit, one has:
log(conv_thr) = a + b*nsteps
This means:
nsteps = [ log(conv_thr) - a] / b
In pratice, usually a = ~2 and conv_thr = 1.0e-P (P = 8~12). Hence:
nsteps = -(P+2)/b ==> P=12: nsteps ~ -14 / b ==> b=-1: nsteps ~ 14
Which seems the vast majority of you cases. From here, one can then still extrapolate from the slop the remaining steps neeeded, and set a maximum step after which we don't think it's worth it. For example max_steps ~ 300.
For b~=0.2 ==> nsteps ~ 70
So one can in still in principle take up to b~-0.05, for which nsteps ~300. But probably then it's more convenient to use an other strategy.
For example:

  • Change mixing_mode: local-TF, or TF (the latter not sure when to use it`
  • Increase mixing_ndim: default is 8, but maybe increase to 12 or 20 (more memory usage, but nowadays this won't really be a problem for actual calculations on HPCs)

I think these two are the only resort in QE, a part from changing the mixing_beta. I experienced some times that increasing e.g. cutoff and kpoints helped really a lot. But shall we authorize ourself to touch these parameters?

@mbercx
Copy link
Member

mbercx commented Nov 24, 2023

Thanks @bastonero! Adding some extrapolation is not a bad idea, will do so in #987.

Re the next strategy, currently we only adapt mixing_beta. As mentioned above, I've never really played with mixing_ndim. Is one preferable over the other in your experience?

But shall we authorize ourself to touch these parameters?

To this I would say no, since the could substantially influence the results (and resource consumption ^^). Moreover, if someone is testing convergence etc one would definitely not want that.

As a final note, I've recently added a bunch of failures to the following repository:

https://github.com/mbercx/qe-issues

The plan was to give these to Baroni & co and have them make suggestions.

  1. I'll definitely unleach our new strategy on these before merging the PR.
  2. If you have any problematic cases to add, feel free to open a PR to the issues repo. I just selected ~100 ones for smaller systems for testing.

@bastonero
Copy link
Collaborator

mixing_ndim sometimes helped in conjunction with changing mixing_beta but also mixing_mode. Possibly one can try in sequence:

  1. Change mixing_beta
  2. Change mixing_mode (to local-TF)
  3. Change again mixing_ndim and/or mixing_beta
  4. As point 3.

This is at least what I think I would try to do. But proper testing on actual problematic cases is a more accurate solution for sure.

Awesome re the issues! I saw there a lot of magnetic cases there. I feel those are the trickiest for sure, but don't have a good solution (probably having a physical intuition for a good starting magnetization can definitely help; nevertheless, sometimes larger steps are needed before the slope decreases sensibly - but sometime it's just luck).

@mbercx
Copy link
Member

mbercx commented Nov 24, 2023

But proper testing on actual problematic cases is a more accurate solution for sure.

Indeed. I'm trying to gather more lower-dimensional cases in https://github.com/mbercx/qe-issues. I think there was also a difficult structure set somewhere we can look into, but I forget where. Will try and remember. ^^

I feel those are the trickiest for sure, but don't have a good solution

So all those did succeed with the direct minimization implemented in nlcglib for SIRIUS. Another thing on the TODO list is to finally properly integrate SIRIUS with the plugin, but there are still some minor things here and there to resolve before that, e.g.:

electronic-structure/q-e-sirius#45

@bastonero
Copy link
Collaborator

Yeah, this direct minimization approach sounds sweet !

@bastonero
Copy link
Collaborator

FYI from QE user guide:

Self-consistency is slow or does not converge at all Bad input data will often result in
bad scf convergence. Please carefully check your structure first, e.g. using XCrySDen.
Assuming that your input data is sensible :

  1. Verify if your system is metallic or is close to a metallic state, especially if you have few
    k-points. If the highest occupied and lowest unoccupied state(s) keep exchanging place
    during self-consistency, forget about reaching convergence. A typical sign of such behavior
    is that the self-consistency error goes down, down, down, than all of a sudden up again,
    and so on. Usually one can solve the problem by adding a few empty bands and a small
    broadening.
  2. Reduce mixing beta to ∼ 0.3 ÷ 0.1 or smaller. Try the mixing mode value that is more
    appropriate for your problem. For slab geometries used in surface problems or for elongated cells, mixing mode=’local-TF’ should be the better choice, dampening ”charge
    sloshing”. You may also try to increase mixing ndim to more than 8 (default value).
    Beware: this will increase the amount of memory you need.
  3. Specific to USPP: the presence of negative charge density regions due to either the
    pseudization procedure of the augmentation part or to truncation at finite cutoff may
    give convergence problems. Raising the ecutrho cutoff for charge density will usually
    help.

@ccigna
Copy link

ccigna commented Nov 24, 2023

Hello all! Thanks a lot for your effort in this, it can be very helpful. Together with the already mentioned solution of changing the mixing_mode for anisotropic structure, and ndim, in my experience also changing the diagonalization to 'cg' can help a lot. It would be helpful in my opinion to understand what are the problems in 'davidson' algorithm in order to figure out in which cases would be worth to change to (slower) 'cg' .

mbercx added a commit to mbercx/aiida-quantumespresso that referenced this issue Nov 29, 2023
The current error handler for the `PwBaseWorkChain` that deals with SCF convergence
issues only tries to reduce the `mixing-beta` by 20% each time the calculation fails,
and doesn't consider the rate of convergence.

Here we improve the error handling for this case by calculating the slope on the
logarithm of the "scf accuracy". Based on some statistics presented in

aiidateam#961

We noted two points:

1. Most of the calculations that converge do so within 50 SCF steps.
2. For those that don't, there is a very low chance of convergence in case the scf
   accuracy slope is higher than -0.1.

Hence, we:

* Limit the default number of maximum steps (`electron_maxstep`) to 50.
* In case of SCF convergence failure, check the scf accuracy slope. If < -0.1, do a
  full restart.
* Else, check the `mixing_beta`. If above 0.1, half it and restart the calculation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants