Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed lots of parallel calculations #790

Merged
merged 1 commit into from
Jun 18, 2024
Merged

fixed lots of parallel calculations #790

merged 1 commit into from
Jun 18, 2024

Conversation

zerothi
Copy link
Owner

@zerothi zerothi commented Jun 18, 2024

A lot of refactoring of the parallel codes
enables huge speedups.

The main culprit for bad scaling is the chunksize.

A new environment variable has been introduced:

SISL_PAR_CHUNKSIZE=25

it specifies the default chunksize for the pool.*() methods. Generally this shows a perfect scaling while fine-tuning can generally help.

I can now see huge parallel performance benefits by leveraging the chunksize variable.

The dispatcher for the bz.apply method now allows a finer tuning of the pool creation.

  1. Single argument

    bz.apply(pool=2)

    Will create a pool with 2 processors

  2. Two tuple

    bz.apply(pool=(2, {"chunksize": 200}))

    will create a pool of 2 processors, and the chunksize will be 200 (regardless of SISL_PAR_CHUNKSIZE)

  3. Three tuple

    bz.apply(pool=(2, {"args": 1}, {"chunksize": 200}))

    will create a pool of 2 processors like so:

    pathos.pools.ProcessPoll(nodes=2, args=1)
    

    check the documentation for pathos to see what can be done. Generally this need not be used.

These 3 variants are currently added, but I would like some input to see whether it makes sense, or whether we should change arguments etc.

The default number of processers is still 1, this is because the OMP_NUM_THREADS can easily create deadlocks if SISL_NUM_PROCS * OMP_NUM_THREADS > CORES. So to be on the safe side, we default it to 1.

The simplest way to control things is to do this in the code:

bz.apply(pool=True)...

then invoking the script with:

SISL_NUM_PROCS=2 SISL_PAR_CHUNKSIZE=50 python3 script.py

then the procs are determined from the variable.
In addition, all parallel invocations will now correctly update the progressbars from tqdm.

Since many routines might loop over distribution functions, we have changed them to enable broad-casting (internally). This required us to bump the required numpy version to >=1.20. This release is from January 2021.

Many dot calls has been changed to @, numpy recommends matmul when that is the intention. The primary reason is that scipy.sparse.dot(np.ndarray) can in certain cases result in an np.ndarray with dtype=object where the actual elements are actually a sparsematrix. So we can't use that.

There are still some corner cases where @ cannot be used. E.g. 1 @ array will fail, it does not work on scalars. This is a bit unfortunate as it would ease things a bit.

Added more typing in state.py, electron.py and some minor other places.

  • Closes #x
  • Added tests for new/changed functions?
  • Ran isort . and black . [24.2.0] at top-level
  • Documentation for functionality in docs/
  • Changes documented in CHANGELOG.md

src/sisl/physics/_brillouinzone_apply.py Fixed Show fixed Hide fixed
src/sisl/physics/_brillouinzone_apply.py Fixed Show fixed Hide fixed
@@ -834,7 +838,7 @@
return out, idx
return out

def align_norm(self, other, ret_index=False, inplace=False):
def align_norm(self, other, ret_index: bool = False, inplace: bool = False):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
@zerothi
Copy link
Owner Author

zerothi commented Jun 18, 2024

@pfebrer @zerothi @nils-wittemeier

I think I have finally found the root cause of bad parallel performance. I can get massive speedups with very little effort.

Please try and read the above, and if you have some complex workflows, feel free to give them a try.
If this makes sense, and you can reproduce, I think we need to add some kind of section with parallel performance to teach users how to get massive speedups.

@pfebrer
Copy link
Contributor

pfebrer commented Jun 18, 2024

I have checked that this works for pdos plots and bands plots (without any changes) 🎉

You can test it with a script like:

parallel_plots.py

import sisl

H = sisl.get_sile("my_run.fdf").read_hamiltonian()

H.plot.pdos(kgrid=[100, 1, 1]).show()

One process:

SISL_NUM_PROCS=1 OMP_NUM_THREADS=1 time python parallel_plots.py
108.89user 0.46system 1:49.47elapsed 99%CPU

Two processes:

SISL_NUM_PROCS=2 OMP_NUM_THREADS=1 time python parallel_plots.py
124.71user 1.29system 1:04.45elapsed 195%CPU

Nice!

@pfebrer
Copy link
Contributor

pfebrer commented Jun 18, 2024

I guess chunksize can be smaller if the calculation for each k point is expensive, right? I wonder if chunksize's default could be set to change automatically with SISL_NUM_PROCS and the size of the matrix 🤔

E.g. imagine a huge matrix (diagonalizing it takes 1 minute) with a kgrid of [3, 3, 1], if the user sets SISL_NUM_PROCS=2 they would expect some acceleration without needing to tweak anything else. But if chunksize is 25 they will get no acceleration.

I would say it is nice that the user can tweak chunksize, but the default could be dynamic so that non-experienced users don't need to worry about it.

@pfebrer
Copy link
Contributor

pfebrer commented Jun 18, 2024

By the way, hybrid parallelization also works a little bit:

SISL_NUM_PROCS=2 OMP_NUM_THREADS=2 time python parallel_plots.py
113.37user 7.02system 0:46.55elapsed 258%CPU

I wonder if it would work better in a computer that is not my laptop 🤔

@zerothi
Copy link
Owner Author

zerothi commented Jun 18, 2024

I guess chunksize can be smaller if the calculation for each k point is expensive, right? I wonder if chunksize's default could be set to change automatically with SISL_NUM_PROCS and the size of the matrix 🤔

E.g. imagine a huge matrix (diagonalizing it takes 1 minute) with a kgrid of [3, 3, 1], if the user sets SISL_NUM_PROCS=2 they would expect some acceleration without needing to tweak anything else. But if chunksize is 25 they will get no acceleration.

I would say it is nice that the user can tweak chunksize, but the default could be dynamic so that non-experienced users don't need to worry about it.

Agreed, I thought about this initially, my first idea was that CHUNKSIZE could be a fractional number, in which case it was some fraction of the iterations / NPROCS. But then the environ class should be able to handle multiple values, I guess this is ok...

I will play with that.

By the way, hybrid parallelization also works a little bit:

SISL_NUM_PROCS=2 OMP_NUM_THREADS=2 time python parallel_plots.py
113.37user 7.02system 0:46.55elapsed 258%CPU

I wonder if it would work better in a computer that is not my laptop 🤔

Yes, hybrid works just fine. For larger systems you'll see it even better. However, there are some fine-tuning of thread placements that might not be optimal, and hence it takes some cycles before it finds the best spot...

I'll make parallel the default, if SISL_NUM_PROCS > 1.

@pfebrer
Copy link
Contributor

pfebrer commented Jun 18, 2024

Although in my case 2 processors doesn't seem to be much better than two threads:

SISL_NUM_PROCS=1 OMP_NUM_THREADS=2 time python parallel_plots.py
93.85user 6.01system 1:14.14elapsed 134%CPU

I guess it is most beneficial to go with processes instead of threads in the case of small matrices with huge number of k points, no? Do you have an understanding of this?

@zerothi
Copy link
Owner Author

zerothi commented Jun 18, 2024

Although in my case 2 processors doesn't seem to be much better than two threads:

SISL_NUM_PROCS=1 OMP_NUM_THREADS=2 time python parallel_plots.py
93.85user 6.01system 1:14.14elapsed 134%CPU

I guess it is most beneficial to go with processes instead of threads in the case of small matrices with huge number of k points, no? Do you have an understanding of this?

The threading is used for BLAS, so you'll only see this for matrices of some size. Probably above 250 (the bigger the better).
You can try and tile so the matrix is ~1000, then you should be able to more clearly see it.
I don't know exactly how the placement influences here, it might be that the threads are really battling because the processes are forked (not spawned), so you might get some other results if you use something like this:

import multiprocessing as mp
mp.set_start_method("spawn")

<rest of script>

@pfebrer
Copy link
Contributor

pfebrer commented Jun 18, 2024

Agreed, I thought about this initially, my first idea was that CHUNKSIZE could be a fractional number, in which case it was some fraction of the iterations / NPROCS.

I think it's fine that if it is not set a default is computed (e.g. like you are proposing) and if it's set explicitly it is the actual size of the chunk, not a fraction.

I'll make parallel the default, if SISL_NUM_PROCS > 1.

Makes sense, I like the tweaking through env variables because you can reuse the same script with no modifications.

@pfebrer
Copy link
Contributor

pfebrer commented Jun 18, 2024

Do you know if pathos is able to distribute work across computing nodes? 🤔 I.e. if the script has been launched with slurm.

If not, maybe there is another solution that can do that, which could be useful down the line :)

@zerothi
Copy link
Owner Author

zerothi commented Jun 18, 2024

Agreed, I thought about this initially, my first idea was that CHUNKSIZE could be a fractional number, in which case it was some fraction of the iterations / NPROCS.

I think it's fine that if it is not set a default is computed (e.g. like you are proposing) and if it's set explicitly it is the actual size of the chunk, not a fraction.

I'll make parallel the default, if SISL_NUM_PROCS > 1.

Makes sense, I like the tweaking through env variables because you can reuse the same script with no modifications.

On the other hand, you might do a convergence test, in which case the fraction might be a good idea?
I'll add it, we can always adjust the default behaviour.

@zerothi
Copy link
Owner Author

zerothi commented Jun 18, 2024

Do you know if pathos is able to distribute work across computing nodes? 🤔 I.e. if the script has been launched with slurm.

If not, maybe there is another solution that can do that, which could be useful down the line :)

There is, but I don't want to complicate things here... ;)
It should be much simpler to do your own mpi4py wrapper which does this ;)

@pfebrer
Copy link
Contributor

pfebrer commented Jun 18, 2024

There is, but I don't want to complicate things here... ;)
It should be much simpler to do your own mpi4py wrapper which does this ;)

Ok, maybe this could be a util in the toolbox. Something like:

mpirun -n 20 sisl_toolbox pdos/bands RUN.fdf

And that could generate a .PDOS or .bands file.

ncpus = None
try:
ncpus = pool.ncpus
except Exception:

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.
if ncpus is None:
try:
ncpus = pool._processes
except Exception:

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.
src/sisl/physics/_brillouinzone_apply.py Fixed Show fixed Hide fixed
src/sisl/physics/_brillouinzone_apply.py Fixed Show fixed Hide fixed
@@ -21,14 +21,21 @@
except ImportError:
_has_xarray = False

try:
import pathos

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'pathos' is not used.
A lot of refactoring of the parallel codes
enables huge speedups.

The main culprit for bad scaling is the chunksize.

A new environment variable has been introduced:

   SISL_PAR_CHUNKSIZE=25 | 0.1

it specifies the default chunksize for the pool.*()
methods.
It defaults to a fractional number in which case it means the number
of inverse chunks each CPU gets, i.e. 0.1 == 10 chunks per processor.
Generally this shows a good scaling while fine-tuning
can generally help.

I can now see huge parallel performance benefits by leveraging
the chunksize variable.

The dispatcher for the `bz.apply` method now allows a finer
tuning of the pool creation.

1. Single argument

    bz.apply(pool=2)

   Will create a pool with 2 processors

2. Two tuple

    bz.apply(pool=(2, {"chunksize": 200}))

   will create a pool of 2 processors, and the chunksize
   will be 200 (regardless of SISL_PAR_CHUNKSIZE)

3. Three tuple

    bz.apply(pool=(2, {"args": 1}, {"chunksize": 200}))

   will create a pool of 2 processors like so:

       pathos.pools.ProcessPoll(nodes=2, args=1)

   check the documentation for `pathos` to see what can be done.
   Generally this need not be used.

These 3 variants are currently added, but I would like some input
to see whether it makes sense, or whether we should change arguments etc.

The default number of processers is still 1, this is because the OMP_NUM_THREADS
can easily create deadlocks if SISL_NUM_PROCS * OMP_NUM_THREADS > CORES.
So to be on the safe side, we default it to 1.

Since parallel processing is now by default *on*, one should simply
need to do:

   SISL_NUM_PROCS=2 SISL_PAR_CHUNKSIZE=50 python3 script.py

then the procs are determined from the variable whenever a bz.apply is found.
In addition, all parallel invocations will now correctly
update the progressbars from tqdm.

Since many routines might loop over distribution functions, we have
changed them to enable broad-casting (internally).
This required us to bump the required numpy version to >=1.20.
This release is from January 2021.

Many dot calls has been changed to @, numpy recommends matmul when
that is the intention. The primary reason is that scipy.sparse.dot(np.ndarray)
can in certain cases result in an np.ndarray with dtype=object where the
actual elements are actually a sparsematrix. So we can't use that.

There are still some corner cases where @ cannot be used.
E.g. 1 @ array will fail, it does not work on scalars. This is a bit
unfortunate as it would ease things a bit.

Added more typing in state.py, electron.py and some minor other places.

Signed-off-by: Nick Papior <[email protected]>
@zerothi zerothi merged commit 4f06f01 into main Jun 18, 2024
3 checks passed
@zerothi zerothi deleted the parallel-bz branch June 18, 2024 19:06
@pfebrer
Copy link
Contributor

pfebrer commented Jun 18, 2024

Another thing that would speed up calculations significantly (more than this) would be to merge #496 🙄

It gives a 100x speed up to the density computation, for me it is the difference between being able to use sisl to compute thousands of densities or not, and the change in the public API is extremely minimal.

@rreho
Copy link

rreho commented Jun 25, 2024

Hi,
Related to this conversation on parallelization, I have a few questions related to the class sisl.viz.FatBandsPlot (I am using the latest version of sisl: master branch).
I provide here a snapshot of the Python script:

fatbands = band_struct.plot.fatbands(bands_style= {'color': 'grey', 'opacity': 0.2, 'width': 2},Erange=[Emin,Emax],fatbands_scale=0.2,fatbands_mode='area_line',bands_mode='scatter')
fatbands.update_inputs(
    groups=[
        {"atoms": [0,1], "color": "blue", "name": "1"},
        {"atoms": [28,29], "color": "red", "name": "15"},
    ],
)

where band_struct is a `sisl.BandStructure' object.

Questions

  1. Considering the high number of k-points I need for this study, is it possible to improve/enable the parallelization? If so. how? [I have 936 orbitals and it takes several hours]
  2. Even with only few k-points the method seems rather slow, especially if compared to similar routines that do not exploit the sisl.viz.FatBands methods. Is it possible to speedup the part related to the computation of the projections?
  3. I did some test and band_struct.plot.fatbands works only with fatbands_mode="area_line. Ideally, I would like to use scatter, access the weights' data (that might be stored in some array) and customize my plots in the notebook. One of the reason for personal customization, is that I found out that backend="matplotlib" might not always work.

Thank you for your help.

@zerothi
Copy link
Owner Author

zerothi commented Jun 25, 2024

Let me note here that it depends on your OS etc.

If you are using linux, and you have as cript, then you should do something like this:

export OMP_NUM_THREADS=1
export SISL_NUM_PROCS=4
python3 script.py

and it should do it automatically.
If you are doing this with a jupyter notebook, then I think you should do this:

export OMP_NUM_THREADS=1
export SISL_NUM_PROCS=4
jupyter notebook notebook.ipynb

note 4 should be changed to whatever you want.
As for 3, you should probably open up a new issue for that one.

@rreho
Copy link

rreho commented Jun 25, 2024

Thanks, so those are only small change in the submission script and/or notebook. Should I also set pool=True somewhere in my script?

@zerothi
Copy link
Owner Author

zerothi commented Jun 25, 2024

Thanks, so those are only small change in the submission script and/or notebook. Should I also set pool=True somewhere in my script?

in main this is the (new) default due to this PR, so you shouldn't need it.
I would suggest you to play with this locally first, and see how it performs. :)

@pfebrer
Copy link
Contributor

pfebrer commented Jun 25, 2024

Is it possible to speedup the part related to the computation of the projections?

The fatbands plot just computes the eigenstates in the way that sisl allows. So there would not be speed benefits from doing it yourself "manually" within sisl.

I did some test and band_struct.plot.fatbands works only with fatbands_mode="area_line. Ideally, I would like to use scatter, access the weights' data (that might be stored in some array) and customize my plots in the notebook. One of the reason for personal customization, is that I found out that backend="matplotlib" might not always work.

You can always compute the fatbands yourself using sisl's methods, or take the data from the plot like:

fatbands.nodes["bands_data"].get()

Then you can do whatever you want with the data. But if you have seen issues with the plotting it would be great if you would create a minimal example of the error and create an issue :) Then we can all benefit from the fixes!

@rreho
Copy link

rreho commented Jun 25, 2024

Thanks for your comments. I created a new issue with a minimal example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants