Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added the option to run inference in parallel #108

Merged
merged 30 commits into from
Jan 22, 2025

Conversation

cathalobrien
Copy link
Contributor

@cathalobrien cathalobrien commented Jan 16, 2025

This PR allows you to run inference across multiple GPUs and nodes.

Parallel inference relies on PR77 to anemoi-models. Running sequentially with an older version of models will still work. Trying to run in parallel with an older version will prompt you to upgrade your anemoi models.
When a newer version of models is released with PR77 included, It might be worthwhile bumping the minimum version required by anemoi-inference.

Currently Slurm is required to launch the parallel processes, and some Slurm env vars are read to set up the networking. Below is an example Slurm batch script

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=0:05:00
#SBATCH --output=outputs/paralell_inf.%j.out

source env.sh
srun anemoi-inference run inf.yaml

One QOL of life feature would be make only process 0 log. Currently you get a lot of spam when running in parallel because each process logs. Any ideas how to do this would nicely would be great

2025-01-16 08:22:53 INFO Accumulating fields ['cp', 'tp']
2025-01-16 08:22:53 INFO Accumulating fields ['cp', 'tp']
2025-01-16 08:22:53 INFO Accumulating fields ['cp', 'tp']
2025-01-16 08:22:53 INFO Accumulating fields ['cp', 'tp']
...
2025-01-16 08:23:26 INFO World size: 4

📚 Documentation preview 📚: https://anemoi-inference--108.org.readthedocs.build/en/108/

@codecov-commenter
Copy link

codecov-commenter commented Jan 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.03%. Comparing base (e9dbd48) to head (ecc8fa0).

Additional details and impacted files
@@           Coverage Diff            @@
##           develop     #108   +/-   ##
========================================
  Coverage    98.03%   98.03%           
========================================
  Files            3        3           
  Lines           51       51           
========================================
  Hits            50       50           
  Misses           1        1           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/anemoi/inference/parallel.py Outdated Show resolved Hide resolved
src/anemoi/inference/runner.py Outdated Show resolved Hide resolved
src/anemoi/inference/parallel.py Outdated Show resolved Hide resolved
src/anemoi/inference/parallel.py Outdated Show resolved Hide resolved
Copy link
Member

@gmertes gmertes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about adding the parallel stuff to the default Runner. I would rather create a separate ParallelRunner for it.

In develop we now have the option to instantiate runners from the config, so you would explicitly select it when you want to do parallel inference.

Opinions on factoring out this code into a new runner?

  • Add a Runner.log() that wraps the logger.
  • Factor out the output block into a Runner.output().
  • Then in the ParallelRunner overrides of those functions, check rank before calling super()
  • Passing of the model_comm_group can be done in the same way as we did for the CRPS runner that's now in develop (see my other comment about predict_step)

src/anemoi/inference/runner.py Outdated Show resolved Hide resolved
@jswijnands
Copy link

Thanks Cathal, this looks very interesting. I had a question about the slurm dependency to launch the parallel processes. Do you mean it will only work using slurm or could this also work on AWS?

@cathalobrien
Copy link
Contributor Author

cathalobrien commented Jan 16, 2025

Thanks Cathal, this looks very interesting. I had a question about the slurm dependency to launch the parallel processes. Do you mean it will only work using slurm or could this also work on AWS?

Hi @jswijnands . Yeah currently you would need Slurm to get the 'srun' program.
For a single node case, we could fallback to launching subprocesses va python. For multi-node inference, we could use mpirun to launch the processes instead.

Whether or not this works on AWS would depend on your exact setup. If you are using AWS Parallel Cluster you could get slurm on that cluster.

It would be nice to get more details about your setup to make sure it is supported. Could you send me an email ([email protected]) or message me on the Slack?

Copy link
Member

@ssmmnn11 ssmmnn11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

src/anemoi/inference/runner.py Outdated Show resolved Hide resolved
src/anemoi/inference/runner.py Outdated Show resolved Hide resolved
@cathalobrien
Copy link
Contributor Author

Now all the parallel code has been refactored into it's own parallel runner class.

Now parallelism is not automatic. you must add "runner: parallel" to the inference config file (and launch with srun).

Duplicated logging is mostly gone now, logging for non-zero ranks is reduced to warnings and errors only.

Copy link
Member

@gmertes gmertes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job on the parallel runner, LGTM!

Future runners that need both single and parallel functionality will cause some difficulties, but let's tackle that when it comes.

Do you want to add a small entry to the docs on parallel inference, with the example job script in there too?

It would be good to also have a standalone mode without slurm, where the runner spawns its own subprocesses. That would be very useful for debugging and running in the cloud. But I would do that in a follow-up PR.

src/anemoi/inference/runner.py Outdated Show resolved Hide resolved
Copy link
Member

@HCookie HCookie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good,

docs/parallel.rst Outdated Show resolved Hide resolved
docs/parallel.rst Outdated Show resolved Hide resolved
docs/parallel.rst Outdated Show resolved Hide resolved
gmertes
gmertes previously approved these changes Jan 22, 2025
Copy link
Member

@gmertes gmertes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my end, but others may still want to have a look.

Copy link
Member

@HCookie HCookie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work on this PR.
Looks great to me

@cathalobrien cathalobrien changed the title Parallel inference feat: Added the option to run inference in parallel Jan 22, 2025
@cathalobrien cathalobrien merged commit e23934e into develop Jan 22, 2025
78 checks passed
@cathalobrien cathalobrien deleted the feature/model-parallel branch January 22, 2025 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants