-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PwBaseCalculations number of electron_maxstep is fixed and repeat the same simulation #961
Comments
Thanks for raising the issue @AndresOrtegaGuerrero! Note that in case the electronic convergence fails, the aiida-quantumespresso/src/aiida_quantumespresso/workflows/pw/base.py Lines 559 to 576 in 5d4a7d9
However, we have already discussed in the past that:
Increasing the number of electronic steps |
@mbercx I like the idea that it checks the output and it takes a decision upon the scf convergence and/change. But definitively to not do 4 times a restart that will end up in failure. |
Hi together! @mbercx Just out of curiosity, have you also discussed to include |
Yeah, I've also done some tests for this. It's rare that it suddenly works after e.g. 2 restarts.
Thanks @t-reents! I haven't played around with this setting much, to be honest, as the default (8) already seemed quite high? But it may be worth testing. Another approach which we still have to fully integrate here is the "direct minimization" method implemented in SIRIUS by @ simonpintarelli: https://github.com/simonpintarelli/nlcglib I've tested the robustness of this alternative approach of finding the electronic ground state quite extensively in the past, and it is very promising. |
@cpignedoli Could you share also your experience on these restarts ? |
Hello everyone, I am also interested in this discussion as I would like to improve the handler too. In the following plot I report the counts for the slope of the convergence, computed using a linear fit (steps vs. log(estimated scf convergence)). TAKE-HOME MESSAGE: as a rule of thumb one can compute the slope, and if it is lower than -0.1/-0.2, then it is probably worth it to increase the steps, otherwise don't and try a different strategy (e.g. change mixing mode, mixing beta, or something else). from aiida import load_profile
from aiida.orm import *
import numpy as np
import matplotlib.pyplot as plt
load_profile()
MAX_NODE_COUNT = 10000
def linear(x,*args):
return args[0] +args[1]*x
def fit_value(kn, array, guess):
from scipy.optimize import curve_fit
params, cov = curve_fit(linear, kn, array, p0=[1, (array[0]-array[-1])/(kn[0]-kn[-1])])
return params, cov
q = QueryBuilder()
q.append(CalcJobNode, filters={
'attributes.exit_status': {'in':[0]},
'attributes.process_state': 'finished',
'attributes.process_label': 'PwCalculation',
})
print("Tot. PwCalculation: ", q.count())
slopes_ok = []
count = 0
for n in q.iterall():
try:
y = n[0].tools.get_scf_accuracy(0)
x = np.array(list(range(len(y))))+1
params, cov = fit_value(x, np.log(y), 0)
slopes_ok.append(params[1])
except:
pass
count+=1
if count > MAX_NODE_COUNT:
break
q = QueryBuilder()
q.append(CalcJobNode, filters={
'attributes.exit_status': {'in':[410]},
'attributes.process_state': 'finished',
'attributes.process_label': 'PwCalculation',
})
print("Tot. PwCalculation: ", q.count())
slopes_fail = []
count = 0
for n in q.iterall():
try:
y = n[0].tools.get_scf_accuracy(0)
x = np.array(list(range(len(y))))+1
params, cov = fit_value(x, np.log(y), 0)
slopes_fail.append(params[1])
except:
pass
count+=1
if count > MAX_NODE_COUNT:
break
plt.hist(slopes_ok, bins=120, label='Converged')
plt.hist(slopes_fail, bins=120, label='Failed')
plt.xlabel('Number of SCF')
plt.ylabel('Convergence slope')
plt.legend()
plt.savefig('./convergence_statistics.pdf', dpi=300, transparent = True, pad_inches = .1, bbox_inches = 'tight') |
Nice, thanks @bastonero! Once I get to my computer I'll check for the MC3D runs, that should give us some more statistics ^^ |
Statistics on the MC3D rSCF runs (first iterations ( I was also curious about the number of SCF steps required for convergence for the successful runs: And had a closer look at the slope for those that needed more than 50 SCF steps The vast majority of the structures that converge do so within 50 SCF steps. For those that don't, looking at the image above, they are very likely to still converge within the next 30 SCF steps if the slope is smaller than -0.1. Note that of course there may still be others that converge if we would have run with more steps. So I would:
For (3), I would only try this maybe once or twice. We now start with 0.4, so maybe we can set Script to generate images#!/usr/bin/env python
# -*- coding: utf-8 -*-
from typing import Optional
import matplotlib.pyplot as plt
import numpy as np
import typer
from aiida import load_profile, orm
from aiida_quantumespresso.calculations.pw import PwCalculation
from rich import print
from rich.progress import track
from scipy.optimize import curve_fit
import numpy as np
load_profile("prod")
app = typer.Typer()
def get_slope_scipy(kn, array):
"""Perform linear fit using scipy's curve_fit function"""
def linear(x,*args):
return args[0] +args[1]*x
params, _ = curve_fit(linear, kn, array, p0=[1, (array[0]-array[-1])/(kn[0]-kn[-1])])
return params[1]
def get_slope_numpy(x, y):
"""Perform linear fit using numpy's polyfit function"""
return np.polyfit(x, y, 1)[0]
@app.command()
def slopes(pwbase_group, min_nstep: Optional[int] = None, limit: int = None, method: str = "numpy"):
"""Sort the slopes for successful and failed calculations."""
query = orm.QueryBuilder()
query.append(orm.Group, filters={"label": pwbase_group}, tag="group").append(
orm.WorkChainNode, with_group="group", tag="workchain"
).append(
PwCalculation,
tag="pw",
edge_filters={"label": "iteration_01"},
filters={"attributes.exit_status": {"in": [0, 410]}},
project='*'
)
if min_nstep is not None:
query.append(
orm.Dict,
with_incoming="pw",
edge_filters={"label": "output_parameters"},
filters={
"attributes.convergence_info.scf_conv.n_scf_steps": {">": min_nstep}
},
)
if limit is not None:
query.limit(limit)
number = query.count()
print(f"[bold blue]Report:[/] Number of completed pw.x: {number}")
slopes_ok = []
slopes_fail = []
errors = []
for [pw_calculation] in track(
query.iterall(), total=number, description="Checking slopes"
):
try:
y = pw_calculation.tools.get_scf_accuracy(0)
x = np.arange(len(y))
if method == "numpy":
slope = get_slope_numpy(x, np.log(y))
elif method == "scipy":
slope = get_slope_scipy(x, np.log(y))
if pw_calculation.exit_status == 0:
slopes_ok.append(slope)
else:
slopes_fail.append(slope)
except Exception as e:
errors.append((pw_calculation.pk, e))
plt.hist(slopes_ok, bins=120, label="Converged", alpha=0.6)
plt.hist(slopes_fail, bins=120, label="Failed", alpha=0.6)
plt.xlabel("Convergence slope")
plt.ylabel("Number of SCF")
plt.legend()
if min_nstep is not None:
filename = f"slopes_{min_nstep}.png"
else:
filename = "slopes.png"
plt.savefig(
filename, dpi=300, pad_inches=0.1, bbox_inches="tight"
)
@app.command()
def steps(pwbase_group, max_nstep: int = 50, limit: int = None):
"""How many steps are needed to converge?"""
query = orm.QueryBuilder()
query.append(orm.Group, filters={"label": pwbase_group}, tag="group").append(
orm.WorkChainNode, with_group="group", tag="workchain"
).append(
PwCalculation,
tag="pw",
edge_filters={"label": "iteration_01"},
filters={"attributes.exit_status": 0},
project="attributes.exit_status",
).append(
orm.Dict,
with_incoming="pw",
edge_filters={"label": "output_parameters"},
project="attributes.convergence_info.scf_conv.n_scf_steps",
)
if limit is not None:
query.limit(limit)
number = query.count()
print(f"[bold blue]Report:[/] Number of completed pw.x: {number}")
n_scf_list = []
for exit_status, n_scf_steps in track(
query.iterall(), total=number, description="Gathering steps"
):
if exit_status == 0:
n_scf_list.append(n_scf_steps)
plt.hist(n_scf_list, bins=40)
plt.xlabel("Number of SCF Steps")
plt.ylabel("Frequency")
plt.axvline(x=max_nstep, color="red")
struct_perc = round(
sum([1 for i in n_scf_list if i <= max_nstep]) / number * 100, 1
)
plt.text(
max_nstep,
plt.ylim()[1] + 0.05 * (plt.ylim()[1] - plt.ylim()[0]),
f"{struct_perc}% of structures",
color="red",
verticalalignment="top",
horizontalalignment="right",
)
plt.savefig(
"./steps.png", dpi=300, pad_inches=0.1, bbox_inches="tight"
)
if __name__ == "__main__":
app() |
The current error handler for the `PwBaseWorkChain` that deals with SCF convergence issues only tries to reduce the `mixing-beta` by 20% each time the calculation fails, and doesn't consider the rate of convergence. Here we improve the error handling for this case by calculating the slope on the logarithm of the "scf accuracy". Based on some statistics presented in aiidateam#961 We noted two points: 1. Most of the calculations that converge do so within 50 SCF steps. 2. For those that don't, there is a very low chance of convergence in case the scf accuracy slope is higher than -0.1. Hence, we: * Limit the default number of maximum steps (`electron_maxstep`) to 50. * In case of SCF convergence failure, check the scf accuracy slope. If < -0.1, do a full restart. * Else, check the `mixing_beta`. If above 0.1, half it and restart the calculation.
Thanks @mbercx, this is a great analysis! Indeed,
I think these two are the only resort in QE, a part from changing the |
Thanks @bastonero! Adding some extrapolation is not a bad idea, will do so in #987. Re the next strategy, currently we only adapt
To this I would say no, since the could substantially influence the results (and resource consumption ^^). Moreover, if someone is testing convergence etc one would definitely not want that. As a final note, I've recently added a bunch of failures to the following repository: https://github.com/mbercx/qe-issues The plan was to give these to Baroni & co and have them make suggestions.
|
This is at least what I think I would try to do. But proper testing on actual problematic cases is a more accurate solution for sure. Awesome re the issues! I saw there a lot of magnetic cases there. I feel those are the trickiest for sure, but don't have a good solution (probably having a physical intuition for a good starting magnetization can definitely help; nevertheless, sometimes larger steps are needed before the slope decreases sensibly - but sometime it's just luck). |
Indeed. I'm trying to gather more lower-dimensional cases in https://github.com/mbercx/qe-issues. I think there was also a difficult structure set somewhere we can look into, but I forget where. Will try and remember. ^^
So all those did succeed with the direct minimization implemented in |
Yeah, this direct minimization approach sounds sweet ! |
FYI from QE user guide:
|
Hello all! Thanks a lot for your effort in this, it can be very helpful. Together with the already mentioned solution of changing the mixing_mode for anisotropic structure, and ndim, in my experience also changing the diagonalization to 'cg' can help a lot. It would be helpful in my opinion to understand what are the problems in 'davidson' algorithm in order to figure out in which cases would be worth to change to (slower) 'cg' . |
The current error handler for the `PwBaseWorkChain` that deals with SCF convergence issues only tries to reduce the `mixing-beta` by 20% each time the calculation fails, and doesn't consider the rate of convergence. Here we improve the error handling for this case by calculating the slope on the logarithm of the "scf accuracy". Based on some statistics presented in aiidateam#961 We noted two points: 1. Most of the calculations that converge do so within 50 SCF steps. 2. For those that don't, there is a very low chance of convergence in case the scf accuracy slope is higher than -0.1. Hence, we: * Limit the default number of maximum steps (`electron_maxstep`) to 50. * In case of SCF convergence failure, check the scf accuracy slope. If < -0.1, do a full restart. * Else, check the `mixing_beta`. If above 0.1, half it and restart the calculation.
When a calculation doesn't reach convergence within the number used in the workchain (80 steps) it does a similar restart.
Repeating the same parameters , this could be up to 4 times.
Perhaps at a second stage (restart) the workchain should just be kill to avoid repeating unsuccessful calculations.
On the other hand, should we consider if the system might need some extra steps? Should the restart try to include 50% more steps and if it doesn't success should it kill the job?
The text was updated successfully, but these errors were encountered: