fixed lots of parallel calculations #790

zerothi · 2024-06-18T13:57:32Z

A lot of refactoring of the parallel codes
enables huge speedups.

The main culprit for bad scaling is the chunksize.

A new environment variable has been introduced:

SISL_PAR_CHUNKSIZE=25

it specifies the default chunksize for the pool.*() methods. Generally this shows a perfect scaling while fine-tuning can generally help.

I can now see huge parallel performance benefits by leveraging the chunksize variable.

The dispatcher for the bz.apply method now allows a finer tuning of the pool creation.

Single argument

bz.apply(pool=2)

Will create a pool with 2 processors
Two tuple

bz.apply(pool=(2, {"chunksize": 200}))

will create a pool of 2 processors, and the chunksize will be 200 (regardless of SISL_PAR_CHUNKSIZE)
Three tuple

bz.apply(pool=(2, {"args": 1}, {"chunksize": 200}))

will create a pool of 2 processors like so:
```
pathos.pools.ProcessPoll(nodes=2, args=1)
```
check the documentation for pathos to see what can be done. Generally this need not be used.

These 3 variants are currently added, but I would like some input to see whether it makes sense, or whether we should change arguments etc.

The default number of processers is still 1, this is because the OMP_NUM_THREADS can easily create deadlocks if SISL_NUM_PROCS * OMP_NUM_THREADS > CORES. So to be on the safe side, we default it to 1.

The simplest way to control things is to do this in the code:

bz.apply(pool=True)...

then invoking the script with:

SISL_NUM_PROCS=2 SISL_PAR_CHUNKSIZE=50 python3 script.py

then the procs are determined from the variable.
In addition, all parallel invocations will now correctly update the progressbars from tqdm.

Since many routines might loop over distribution functions, we have changed them to enable broad-casting (internally). This required us to bump the required numpy version to >=1.20. This release is from January 2021.

Many dot calls has been changed to @, numpy recommends matmul when that is the intention. The primary reason is that scipy.sparse.dot(np.ndarray) can in certain cases result in an np.ndarray with dtype=object where the actual elements are actually a sparsematrix. So we can't use that.

There are still some corner cases where @ cannot be used. E.g. 1 @ array will fail, it does not work on scalars. This is a bit unfortunate as it would ease things a bit.

Added more typing in state.py, electron.py and some minor other places.

Closes #x
Added tests for new/changed functions?
Ran isort . and black . [24.2.0] at top-level
Documentation for functionality in docs/
Changes documented in CHANGELOG.md

src/sisl/physics/_brillouinzone_apply.py

src/sisl/physics/state.py

@@ -834,7 +838,7 @@
            return out, idx
        return out

-    def align_norm(self, other, ret_index=False, inplace=False):
+    def align_norm(self, other, ret_index: bool = False, inplace: bool = False):


zerothi · 2024-06-18T14:01:37Z

@pfebrer @zerothi @nils-wittemeier

I think I have finally found the root cause of bad parallel performance. I can get massive speedups with very little effort.

Please try and read the above, and if you have some complex workflows, feel free to give them a try.
If this makes sense, and you can reproduce, I think we need to add some kind of section with parallel performance to teach users how to get massive speedups.

pfebrer · 2024-06-18T17:38:28Z

I have checked that this works for pdos plots and bands plots (without any changes) 🎉

You can test it with a script like:

parallel_plots.py

import sisl

H = sisl.get_sile("my_run.fdf").read_hamiltonian()

H.plot.pdos(kgrid=[100, 1, 1]).show()

One process:

SISL_NUM_PROCS=1 OMP_NUM_THREADS=1 time python parallel_plots.py

108.89user 0.46system 1:49.47elapsed 99%CPU

Two processes:

SISL_NUM_PROCS=2 OMP_NUM_THREADS=1 time python parallel_plots.py

124.71user 1.29system 1:04.45elapsed 195%CPU

Nice!

pfebrer · 2024-06-18T17:46:31Z

I guess chunksize can be smaller if the calculation for each k point is expensive, right? I wonder if chunksize's default could be set to change automatically with SISL_NUM_PROCS and the size of the matrix 🤔

E.g. imagine a huge matrix (diagonalizing it takes 1 minute) with a kgrid of [3, 3, 1], if the user sets SISL_NUM_PROCS=2 they would expect some acceleration without needing to tweak anything else. But if chunksize is 25 they will get no acceleration.

I would say it is nice that the user can tweak chunksize, but the default could be dynamic so that non-experienced users don't need to worry about it.

pfebrer · 2024-06-18T17:50:06Z

By the way, hybrid parallelization also works a little bit:

SISL_NUM_PROCS=2 OMP_NUM_THREADS=2 time python parallel_plots.py

113.37user 7.02system 0:46.55elapsed 258%CPU

I wonder if it would work better in a computer that is not my laptop 🤔

zerothi · 2024-06-18T18:00:15Z

I guess chunksize can be smaller if the calculation for each k point is expensive, right? I wonder if chunksize's default could be set to change automatically with SISL_NUM_PROCS and the size of the matrix 🤔

E.g. imagine a huge matrix (diagonalizing it takes 1 minute) with a kgrid of [3, 3, 1], if the user sets SISL_NUM_PROCS=2 they would expect some acceleration without needing to tweak anything else. But if chunksize is 25 they will get no acceleration.

I would say it is nice that the user can tweak chunksize, but the default could be dynamic so that non-experienced users don't need to worry about it.

Agreed, I thought about this initially, my first idea was that CHUNKSIZE could be a fractional number, in which case it was some fraction of the iterations / NPROCS. But then the environ class should be able to handle multiple values, I guess this is ok...

I will play with that.

By the way, hybrid parallelization also works a little bit:
SISL_NUM_PROCS=2 OMP_NUM_THREADS=2 time python parallel_plots.py
113.37user 7.02system 0:46.55elapsed 258%CPU
I wonder if it would work better in a computer that is not my laptop 🤔

Yes, hybrid works just fine. For larger systems you'll see it even better. However, there are some fine-tuning of thread placements that might not be optimal, and hence it takes some cycles before it finds the best spot...

I'll make parallel the default, if SISL_NUM_PROCS > 1.

pfebrer · 2024-06-18T18:00:47Z

Although in my case 2 processors doesn't seem to be much better than two threads:

SISL_NUM_PROCS=1 OMP_NUM_THREADS=2 time python parallel_plots.py

93.85user 6.01system 1:14.14elapsed 134%CPU

I guess it is most beneficial to go with processes instead of threads in the case of small matrices with huge number of k points, no? Do you have an understanding of this?

zerothi · 2024-06-18T18:05:32Z

Although in my case 2 processors doesn't seem to be much better than two threads:
SISL_NUM_PROCS=1 OMP_NUM_THREADS=2 time python parallel_plots.py
93.85user 6.01system 1:14.14elapsed 134%CPU
I guess it is most beneficial to go with processes instead of threads in the case of small matrices with huge number of k points, no? Do you have an understanding of this?

The threading is used for BLAS, so you'll only see this for matrices of some size. Probably above 250 (the bigger the better).
You can try and tile so the matrix is ~1000, then you should be able to more clearly see it.
I don't know exactly how the placement influences here, it might be that the threads are really battling because the processes are forked (not spawned), so you might get some other results if you use something like this:

import multiprocessing as mp
mp.set_start_method("spawn")

<rest of script>

pfebrer · 2024-06-18T18:06:31Z

Agreed, I thought about this initially, my first idea was that CHUNKSIZE could be a fractional number, in which case it was some fraction of the iterations / NPROCS.

I think it's fine that if it is not set a default is computed (e.g. like you are proposing) and if it's set explicitly it is the actual size of the chunk, not a fraction.

I'll make parallel the default, if SISL_NUM_PROCS > 1.

Makes sense, I like the tweaking through env variables because you can reuse the same script with no modifications.

pfebrer · 2024-06-18T18:10:06Z

Do you know if pathos is able to distribute work across computing nodes? 🤔 I.e. if the script has been launched with slurm.

If not, maybe there is another solution that can do that, which could be useful down the line :)

zerothi · 2024-06-18T18:10:18Z

Agreed, I thought about this initially, my first idea was that CHUNKSIZE could be a fractional number, in which case it was some fraction of the iterations / NPROCS.

I think it's fine that if it is not set a default is computed (e.g. like you are proposing) and if it's set explicitly it is the actual size of the chunk, not a fraction.

I'll make parallel the default, if SISL_NUM_PROCS > 1.

Makes sense, I like the tweaking through env variables because you can reuse the same script with no modifications.

On the other hand, you might do a convergence test, in which case the fraction might be a good idea?
I'll add it, we can always adjust the default behaviour.

zerothi · 2024-06-18T18:10:53Z

Do you know if pathos is able to distribute work across computing nodes? 🤔 I.e. if the script has been launched with slurm.

If not, maybe there is another solution that can do that, which could be useful down the line :)

There is, but I don't want to complicate things here... ;)
It should be much simpler to do your own mpi4py wrapper which does this ;)

pfebrer · 2024-06-18T18:18:56Z

There is, but I don't want to complicate things here... ;)
It should be much simpler to do your own mpi4py wrapper which does this ;)

Ok, maybe this could be a util in the toolbox. Something like:

mpirun -n 20 sisl_toolbox pdos/bands RUN.fdf

And that could generate a .PDOS or .bands file.

src/sisl/physics/_brillouinzone_apply.py

+        ncpus = None
+        try:
+            ncpus = pool.ncpus
+        except Exception:


src/sisl/physics/_brillouinzone_apply.py

+        if ncpus is None:
+            try:
+                ncpus = pool._processes
+            except Exception:


src/sisl/physics/_brillouinzone_apply.py

@@ -21,14 +21,21 @@
 except ImportError:
    _has_xarray = False

+try:
+    import pathos


A lot of refactoring of the parallel codes enables huge speedups. The main culprit for bad scaling is the chunksize. A new environment variable has been introduced: SISL_PAR_CHUNKSIZE=25 | 0.1 it specifies the default chunksize for the pool.*() methods. It defaults to a fractional number in which case it means the number of inverse chunks each CPU gets, i.e. 0.1 == 10 chunks per processor. Generally this shows a good scaling while fine-tuning can generally help. I can now see huge parallel performance benefits by leveraging the chunksize variable. The dispatcher for the `bz.apply` method now allows a finer tuning of the pool creation. 1. Single argument bz.apply(pool=2) Will create a pool with 2 processors 2. Two tuple bz.apply(pool=(2, {"chunksize": 200})) will create a pool of 2 processors, and the chunksize will be 200 (regardless of SISL_PAR_CHUNKSIZE) 3. Three tuple bz.apply(pool=(2, {"args": 1}, {"chunksize": 200})) will create a pool of 2 processors like so: pathos.pools.ProcessPoll(nodes=2, args=1) check the documentation for `pathos` to see what can be done. Generally this need not be used. These 3 variants are currently added, but I would like some input to see whether it makes sense, or whether we should change arguments etc. The default number of processers is still 1, this is because the OMP_NUM_THREADS can easily create deadlocks if SISL_NUM_PROCS * OMP_NUM_THREADS > CORES. So to be on the safe side, we default it to 1. Since parallel processing is now by default *on*, one should simply need to do: SISL_NUM_PROCS=2 SISL_PAR_CHUNKSIZE=50 python3 script.py then the procs are determined from the variable whenever a bz.apply is found. In addition, all parallel invocations will now correctly update the progressbars from tqdm. Since many routines might loop over distribution functions, we have changed them to enable broad-casting (internally). This required us to bump the required numpy version to >=1.20. This release is from January 2021. Many dot calls has been changed to @, numpy recommends matmul when that is the intention. The primary reason is that scipy.sparse.dot(np.ndarray) can in certain cases result in an np.ndarray with dtype=object where the actual elements are actually a sparsematrix. So we can't use that. There are still some corner cases where @ cannot be used. E.g. 1 @ array will fail, it does not work on scalars. This is a bit unfortunate as it would ease things a bit. Added more typing in state.py, electron.py and some minor other places. Signed-off-by: Nick Papior <[email protected]>

pfebrer · 2024-06-18T22:31:29Z

Another thing that would speed up calculations significantly (more than this) would be to merge #496 🙄

It gives a 100x speed up to the density computation, for me it is the difference between being able to use sisl to compute thousands of densities or not, and the change in the public API is extremely minimal.

rreho · 2024-06-25T12:29:20Z

Hi,
Related to this conversation on parallelization, I have a few questions related to the class sisl.viz.FatBandsPlot (I am using the latest version of sisl: master branch).
I provide here a snapshot of the Python script:

fatbands = band_struct.plot.fatbands(bands_style= {'color': 'grey', 'opacity': 0.2, 'width': 2},Erange=[Emin,Emax],fatbands_scale=0.2,fatbands_mode='area_line',bands_mode='scatter')
fatbands.update_inputs(
    groups=[
        {"atoms": [0,1], "color": "blue", "name": "1"},
        {"atoms": [28,29], "color": "red", "name": "15"},
    ],
)

where band_struct is a `sisl.BandStructure' object.

Questions

Considering the high number of k-points I need for this study, is it possible to improve/enable the parallelization? If so. how? [I have 936 orbitals and it takes several hours]
Even with only few k-points the method seems rather slow, especially if compared to similar routines that do not exploit the sisl.viz.FatBands methods. Is it possible to speedup the part related to the computation of the projections?
I did some test and band_struct.plot.fatbands works only with fatbands_mode="area_line. Ideally, I would like to use scatter, access the weights' data (that might be stored in some array) and customize my plots in the notebook. One of the reason for personal customization, is that I found out that backend="matplotlib" might not always work.

Thank you for your help.

zerothi · 2024-06-25T12:35:41Z

Let me note here that it depends on your OS etc.

If you are using linux, and you have as cript, then you should do something like this:

export OMP_NUM_THREADS=1
export SISL_NUM_PROCS=4
python3 script.py

and it should do it automatically.
If you are doing this with a jupyter notebook, then I think you should do this:

export OMP_NUM_THREADS=1
export SISL_NUM_PROCS=4
jupyter notebook notebook.ipynb

note 4 should be changed to whatever you want.
As for 3, you should probably open up a new issue for that one.

rreho · 2024-06-25T12:39:39Z

Thanks, so those are only small change in the submission script and/or notebook. Should I also set pool=True somewhere in my script?

zerothi · 2024-06-25T12:41:43Z

Thanks, so those are only small change in the submission script and/or notebook. Should I also set pool=True somewhere in my script?

in main this is the (new) default due to this PR, so you shouldn't need it.
I would suggest you to play with this locally first, and see how it performs. :)

pfebrer · 2024-06-25T13:05:15Z

Is it possible to speedup the part related to the computation of the projections?

The fatbands plot just computes the eigenstates in the way that sisl allows. So there would not be speed benefits from doing it yourself "manually" within sisl.

I did some test and band_struct.plot.fatbands works only with fatbands_mode="area_line. Ideally, I would like to use scatter, access the weights' data (that might be stored in some array) and customize my plots in the notebook. One of the reason for personal customization, is that I found out that backend="matplotlib" might not always work.

You can always compute the fatbands yourself using sisl's methods, or take the data from the plot like:

fatbands.nodes["bands_data"].get()

Then you can do whatever you want with the data. But if you have seen issues with the plotting it would be great if you would create a minimal example of the error and create an issue :) Then we can all benefit from the fixes!

rreho · 2024-06-25T14:09:10Z

Thanks for your comments. I created a new issue with a minimal example.

github-advanced-security bot found potential problems Jun 18, 2024

View reviewed changes

zerothi mentioned this pull request Jun 18, 2024

Transport routines within the kubo approach #766

Closed

2 tasks

zerothi force-pushed the parallel-bz branch from 7754d09 to 9a7c286 Compare June 18, 2024 18:34

github-advanced-security bot found potential problems Jun 18, 2024

View reviewed changes

zerothi force-pushed the parallel-bz branch from 9a7c286 to 24a903d Compare June 18, 2024 18:44

zerothi force-pushed the parallel-bz branch from 24a903d to 12425f8 Compare June 18, 2024 19:05

zerothi merged commit 4f06f01 into main Jun 18, 2024
3 checks passed

zerothi deleted the parallel-bz branch June 18, 2024 19:06

rreho mentioned this pull request Jun 25, 2024

sisl.viz.FatBandsPlot error in fatbands_mode = 'line/scatter' #799

Closed

pfebrer mentioned this pull request Nov 7, 2024

Fixed matplotlib fatbands and enabled opacity #862

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixed lots of parallel calculations #790

fixed lots of parallel calculations #790

zerothi commented Jun 18, 2024

zerothi commented Jun 18, 2024

pfebrer commented Jun 18, 2024

pfebrer commented Jun 18, 2024

pfebrer commented Jun 18, 2024

zerothi commented Jun 18, 2024

pfebrer commented Jun 18, 2024 •

edited

Loading

zerothi commented Jun 18, 2024

pfebrer commented Jun 18, 2024

pfebrer commented Jun 18, 2024

zerothi commented Jun 18, 2024

zerothi commented Jun 18, 2024

pfebrer commented Jun 18, 2024

pfebrer commented Jun 18, 2024

rreho commented Jun 25, 2024

zerothi commented Jun 25, 2024

rreho commented Jun 25, 2024

zerothi commented Jun 25, 2024

pfebrer commented Jun 25, 2024

rreho commented Jun 25, 2024

fixed lots of parallel calculations #790

fixed lots of parallel calculations #790

Conversation

zerothi commented Jun 18, 2024

zerothi commented Jun 18, 2024

pfebrer commented Jun 18, 2024

pfebrer commented Jun 18, 2024

pfebrer commented Jun 18, 2024

zerothi commented Jun 18, 2024

pfebrer commented Jun 18, 2024 • edited Loading

zerothi commented Jun 18, 2024

pfebrer commented Jun 18, 2024

pfebrer commented Jun 18, 2024

zerothi commented Jun 18, 2024

zerothi commented Jun 18, 2024

pfebrer commented Jun 18, 2024

pfebrer commented Jun 18, 2024

rreho commented Jun 25, 2024

zerothi commented Jun 25, 2024

rreho commented Jun 25, 2024

zerothi commented Jun 25, 2024

pfebrer commented Jun 25, 2024

rreho commented Jun 25, 2024

pfebrer commented Jun 18, 2024 •

edited

Loading