-
-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixed lots of parallel calculations #790
Conversation
@@ -834,7 +838,7 @@ | |||
return out, idx | |||
return out | |||
|
|||
def align_norm(self, other, ret_index=False, inplace=False): | |||
def align_norm(self, other, ret_index: bool = False, inplace: bool = False): |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns Note
@pfebrer @zerothi @nils-wittemeier I think I have finally found the root cause of bad parallel performance. I can get massive speedups with very little effort. Please try and read the above, and if you have some complex workflows, feel free to give them a try. |
I have checked that this works for pdos plots and bands plots (without any changes) 🎉 You can test it with a script like: parallel_plots.py import sisl
H = sisl.get_sile("my_run.fdf").read_hamiltonian()
H.plot.pdos(kgrid=[100, 1, 1]).show() One process:
Two processes:
Nice! |
I guess chunksize can be smaller if the calculation for each k point is expensive, right? I wonder if chunksize's default could be set to change automatically with E.g. imagine a huge matrix (diagonalizing it takes 1 minute) with a kgrid of [3, 3, 1], if the user sets I would say it is nice that the user can tweak chunksize, but the default could be dynamic so that non-experienced users don't need to worry about it. |
By the way, hybrid parallelization also works a little bit: SISL_NUM_PROCS=2 OMP_NUM_THREADS=2 time python parallel_plots.py
I wonder if it would work better in a computer that is not my laptop 🤔 |
Agreed, I thought about this initially, my first idea was that I will play with that.
Yes, hybrid works just fine. For larger systems you'll see it even better. However, there are some fine-tuning of thread placements that might not be optimal, and hence it takes some cycles before it finds the best spot... I'll make parallel the default, if |
Although in my case 2 processors doesn't seem to be much better than two threads:
I guess it is most beneficial to go with processes instead of threads in the case of small matrices with huge number of k points, no? Do you have an understanding of this? |
The threading is used for BLAS, so you'll only see this for matrices of some size. Probably above 250 (the bigger the better). import multiprocessing as mp
mp.set_start_method("spawn")
<rest of script> |
I think it's fine that if it is not set a default is computed (e.g. like you are proposing) and if it's set explicitly it is the actual size of the chunk, not a fraction.
Makes sense, I like the tweaking through env variables because you can reuse the same script with no modifications. |
Do you know if pathos is able to distribute work across computing nodes? 🤔 I.e. if the script has been launched with slurm. If not, maybe there is another solution that can do that, which could be useful down the line :) |
On the other hand, you might do a convergence test, in which case the fraction might be a good idea? |
There is, but I don't want to complicate things here... ;) |
Ok, maybe this could be a util in the toolbox. Something like: mpirun -n 20 sisl_toolbox pdos/bands RUN.fdf And that could generate a |
ncpus = None | ||
try: | ||
ncpus = pool.ncpus | ||
except Exception: |
Check notice
Code scanning / CodeQL
Empty except Note
if ncpus is None: | ||
try: | ||
ncpus = pool._processes | ||
except Exception: |
Check notice
Code scanning / CodeQL
Empty except Note
@@ -21,14 +21,21 @@ | |||
except ImportError: | |||
_has_xarray = False | |||
|
|||
try: | |||
import pathos |
Check notice
Code scanning / CodeQL
Unused import Note
A lot of refactoring of the parallel codes enables huge speedups. The main culprit for bad scaling is the chunksize. A new environment variable has been introduced: SISL_PAR_CHUNKSIZE=25 | 0.1 it specifies the default chunksize for the pool.*() methods. It defaults to a fractional number in which case it means the number of inverse chunks each CPU gets, i.e. 0.1 == 10 chunks per processor. Generally this shows a good scaling while fine-tuning can generally help. I can now see huge parallel performance benefits by leveraging the chunksize variable. The dispatcher for the `bz.apply` method now allows a finer tuning of the pool creation. 1. Single argument bz.apply(pool=2) Will create a pool with 2 processors 2. Two tuple bz.apply(pool=(2, {"chunksize": 200})) will create a pool of 2 processors, and the chunksize will be 200 (regardless of SISL_PAR_CHUNKSIZE) 3. Three tuple bz.apply(pool=(2, {"args": 1}, {"chunksize": 200})) will create a pool of 2 processors like so: pathos.pools.ProcessPoll(nodes=2, args=1) check the documentation for `pathos` to see what can be done. Generally this need not be used. These 3 variants are currently added, but I would like some input to see whether it makes sense, or whether we should change arguments etc. The default number of processers is still 1, this is because the OMP_NUM_THREADS can easily create deadlocks if SISL_NUM_PROCS * OMP_NUM_THREADS > CORES. So to be on the safe side, we default it to 1. Since parallel processing is now by default *on*, one should simply need to do: SISL_NUM_PROCS=2 SISL_PAR_CHUNKSIZE=50 python3 script.py then the procs are determined from the variable whenever a bz.apply is found. In addition, all parallel invocations will now correctly update the progressbars from tqdm. Since many routines might loop over distribution functions, we have changed them to enable broad-casting (internally). This required us to bump the required numpy version to >=1.20. This release is from January 2021. Many dot calls has been changed to @, numpy recommends matmul when that is the intention. The primary reason is that scipy.sparse.dot(np.ndarray) can in certain cases result in an np.ndarray with dtype=object where the actual elements are actually a sparsematrix. So we can't use that. There are still some corner cases where @ cannot be used. E.g. 1 @ array will fail, it does not work on scalars. This is a bit unfortunate as it would ease things a bit. Added more typing in state.py, electron.py and some minor other places. Signed-off-by: Nick Papior <[email protected]>
Another thing that would speed up calculations significantly (more than this) would be to merge #496 🙄 It gives a 100x speed up to the density computation, for me it is the difference between being able to use sisl to compute thousands of densities or not, and the change in the public API is extremely minimal. |
Hi,
where band_struct is a `sisl.BandStructure' object. Questions
Thank you for your help. |
Let me note here that it depends on your OS etc. If you are using linux, and you have as cript, then you should do something like this: export OMP_NUM_THREADS=1
export SISL_NUM_PROCS=4
python3 script.py and it should do it automatically. export OMP_NUM_THREADS=1
export SISL_NUM_PROCS=4
jupyter notebook notebook.ipynb note |
Thanks, so those are only small change in the submission script and/or notebook. Should I also set |
in |
The fatbands plot just computes the eigenstates in the way that sisl allows. So there would not be speed benefits from doing it yourself "manually" within sisl.
You can always compute the fatbands yourself using sisl's methods, or take the data from the plot like: fatbands.nodes["bands_data"].get() Then you can do whatever you want with the data. But if you have seen issues with the plotting it would be great if you would create a minimal example of the error and create an issue :) Then we can all benefit from the fixes! |
Thanks for your comments. I created a new issue with a minimal example. |
A lot of refactoring of the parallel codes
enables huge speedups.
The main culprit for bad scaling is the chunksize.
A new environment variable has been introduced:
SISL_PAR_CHUNKSIZE=25
it specifies the default chunksize for the pool.*() methods. Generally this shows a perfect scaling while fine-tuning can generally help.
I can now see huge parallel performance benefits by leveraging the chunksize variable.
The dispatcher for the
bz.apply
method now allows a finer tuning of the pool creation.Single argument
bz.apply(pool=2)
Will create a pool with 2 processors
Two tuple
bz.apply(pool=(2, {"chunksize": 200}))
will create a pool of 2 processors, and the chunksize will be 200 (regardless of SISL_PAR_CHUNKSIZE)
Three tuple
bz.apply(pool=(2, {"args": 1}, {"chunksize": 200}))
will create a pool of 2 processors like so:
check the documentation for
pathos
to see what can be done. Generally this need not be used.These 3 variants are currently added, but I would like some input to see whether it makes sense, or whether we should change arguments etc.
The default number of processers is still 1, this is because the OMP_NUM_THREADS can easily create deadlocks if SISL_NUM_PROCS * OMP_NUM_THREADS > CORES. So to be on the safe side, we default it to 1.
The simplest way to control things is to do this in the code:
bz.apply(pool=True)...
then invoking the script with:
SISL_NUM_PROCS=2 SISL_PAR_CHUNKSIZE=50 python3 script.py
then the procs are determined from the variable.
In addition, all parallel invocations will now correctly update the progressbars from tqdm.
Since many routines might loop over distribution functions, we have changed them to enable broad-casting (internally). This required us to bump the required numpy version to >=1.20. This release is from January 2021.
Many dot calls has been changed to @, numpy recommends matmul when that is the intention. The primary reason is that scipy.sparse.dot(np.ndarray) can in certain cases result in an np.ndarray with dtype=object where the actual elements are actually a sparsematrix. So we can't use that.
There are still some corner cases where @ cannot be used. E.g. 1 @ array will fail, it does not work on scalars. This is a bit unfortunate as it would ease things a bit.
Added more typing in state.py, electron.py and some minor other places.
isort .
andblack .
[24.2.0] at top-leveldocs/
CHANGELOG.md