feat: Added the option to run inference in parallel #108

cathalobrien · 2025-01-16T08:49:28Z

This PR allows you to run inference across multiple GPUs and nodes.

Parallel inference relies on PR77 to anemoi-models. Running sequentially with an older version of models will still work. Trying to run in parallel with an older version will prompt you to upgrade your anemoi models.
When a newer version of models is released with PR77 included, It might be worthwhile bumping the minimum version required by anemoi-inference.

Currently Slurm is required to launch the parallel processes, and some Slurm env vars are read to set up the networking. Below is an example Slurm batch script

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=0:05:00
#SBATCH --output=outputs/paralell_inf.%j.out

source env.sh
srun anemoi-inference run inf.yaml

One QOL of life feature would be make only process 0 log. Currently you get a lot of spam when running in parallel because each process logs. Any ideas how to do this would nicely would be great

2025-01-16 08:22:53 INFO Accumulating fields ['cp', 'tp']
2025-01-16 08:22:53 INFO Accumulating fields ['cp', 'tp']
2025-01-16 08:22:53 INFO Accumulating fields ['cp', 'tp']
2025-01-16 08:22:53 INFO Accumulating fields ['cp', 'tp']
...
2025-01-16 08:23:26 INFO World size: 4

📚 Documentation preview 📚: https://anemoi-inference--108.org.readthedocs.build/en/108/

codecov-commenter · 2025-01-16T09:03:09Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.03%. Comparing base (e9dbd48) to head (ecc8fa0).

Additional details and impacted files

@@           Coverage Diff            @@
##           develop     #108   +/-   ##
========================================
  Coverage    98.03%   98.03%           
========================================
  Files            3        3           
  Lines           51       51           
========================================
  Hits            50       50           
  Misses           1        1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/anemoi/inference/parallel.py

src/anemoi/inference/runner.py

src/anemoi/inference/parallel.py

gmertes

I'm a bit concerned about adding the parallel stuff to the default Runner. I would rather create a separate ParallelRunner for it.

In develop we now have the option to instantiate runners from the config, so you would explicitly select it when you want to do parallel inference.

Opinions on factoring out this code into a new runner?

Add a Runner.log() that wraps the logger.
Factor out the output block into a Runner.output().
Then in the ParallelRunner overrides of those functions, check rank before calling super()
Passing of the model_comm_group can be done in the same way as we did for the CRPS runner that's now in develop (see my other comment about predict_step)

src/anemoi/inference/runner.py

jswijnands · 2025-01-16T13:44:16Z

Thanks Cathal, this looks very interesting. I had a question about the slurm dependency to launch the parallel processes. Do you mean it will only work using slurm or could this also work on AWS?

cathalobrien · 2025-01-16T13:55:28Z

Thanks Cathal, this looks very interesting. I had a question about the slurm dependency to launch the parallel processes. Do you mean it will only work using slurm or could this also work on AWS?

Hi @jswijnands . Yeah currently you would need Slurm to get the 'srun' program.
For a single node case, we could fallback to launching subprocesses va python. For multi-node inference, we could use mpirun to launch the processes instead.

Whether or not this works on AWS would depend on your exact setup. If you are using AWS Parallel Cluster you could get slurm on that cluster.

It would be nice to get more details about your setup to make sure it is supported. Could you send me an email ([email protected]) or message me on the Slack?

ssmmnn11

nice work!

src/anemoi/inference/runner.py

…llel

cathalobrien · 2025-01-17T13:43:46Z

Now all the parallel code has been refactored into it's own parallel runner class.

Now parallelism is not automatic. you must add "runner: parallel" to the inference config file (and launch with srun).

Duplicated logging is mostly gone now, logging for non-zero ranks is reduced to warnings and errors only.

src/anemoi/inference/runners/parallel.py

gmertes

Great job on the parallel runner, LGTM!

Future runners that need both single and parallel functionality will cause some difficulties, but let's tackle that when it comes.

Do you want to add a small entry to the docs on parallel inference, with the example job script in there too?

It would be good to also have a standalone mode without slurm, where the runner spawns its own subprocesses. That would be very useful for debugging and running in the cloud. But I would do that in a follow-up PR.

src/anemoi/inference/runner.py

for more information, see https://pre-commit.ci

…rence into feature/model-parallel

HCookie

Looks really good,

docs/parallel.rst

gmertes

LGTM from my end, but others may still want to have a look.

HCookie

Fantastic work on this PR.
Looks great to me

cathalobrien added 14 commits November 25, 2024 10:10

model parallel wip

ce74e6e

logging only on rank 0

936c60a

fallback if env vars arent set and some work only done by rank 0

d870289

changelog

b39b796

pre-commit checks and no model comm group for single gpu case

b95e167

changelog

9fe691c

added parallel inf

5f92574

precommit

71fdf0e

9k parallel inference works

9264754

refactor

06a575d

refactor

fa89bb8

tidy

a6a4ea4

more compatible with older versions of models

8a73f62

forgot precommit

db560eb

cathalobrien requested review from b8raoult, gmertes and HCookie January 16, 2025 08:49

remove commented code

b21d811

cathalobrien mentioned this pull request Jan 16, 2025

GPU parallel inference #46

Closed

tmi reviewed Jan 16, 2025

View reviewed changes

src/anemoi/inference/parallel.py Outdated Show resolved Hide resolved

HCookie reviewed Jan 16, 2025

View reviewed changes

src/anemoi/inference/parallel.py Outdated Show resolved Hide resolved

src/anemoi/inference/runner.py Outdated Show resolved Hide resolved

src/anemoi/inference/parallel.py Outdated Show resolved Hide resolved

src/anemoi/inference/parallel.py Outdated Show resolved Hide resolved

cathalobrien added 2 commits January 16, 2025 10:33

added license

48ad37b

feedback

b9ecc14

gmertes reviewed Jan 16, 2025

View reviewed changes

src/anemoi/inference/runner.py Outdated Show resolved Hide resolved

ssmmnn11 requested changes Jan 16, 2025

View reviewed changes

src/anemoi/inference/runner.py Outdated Show resolved Hide resolved

src/anemoi/inference/runner.py Outdated Show resolved Hide resolved

cathalobrien added 2 commits January 17, 2025 09:11

Merge remote-tracking branch 'origin/develop' into feature/model-para…

1a0ae49

…llel

refactor to parallel runner

27965ff

refactored into explicit parallel runner class

43167c5

cathalobrien commented Jan 20, 2025

View reviewed changes

src/anemoi/inference/runners/parallel.py Outdated Show resolved Hide resolved

allow MASTER_ADDR and MASTER_PORT to be set as env vars before runtime

6974ac3

gmertes reviewed Jan 20, 2025

View reviewed changes

src/anemoi/inference/runner.py Outdated Show resolved Hide resolved

cathalobrien and others added 5 commits January 21, 2025 15:48

readd line accicdentally deleted

2016c7b

added documentation

bd391f5

[pre-commit.ci] auto fixes from pre-commit.com hooks

1cd4982

for more information, see https://pre-commit.ci

forgot precommit

079036a

Merge branch 'feature/model-parallel' of github.com:ecmwf/anemoi-infe…

d6a77ff

…rence into feature/model-parallel

HCookie reviewed Jan 21, 2025

View reviewed changes

docs/parallel.rst Outdated Show resolved Hide resolved

docs/parallel.rst Outdated Show resolved Hide resolved

docs/parallel.rst Outdated Show resolved Hide resolved

docs feedback

b8be926

gmertes previously approved these changes Jan 22, 2025

View reviewed changes

cathalobrien added 2 commits January 22, 2025 14:08

added a link to parallel inference to index

5dd8a55

Ensure each model has the same seed

861161d

cathalobrien dismissed gmertes’s stale review via 861161d January 22, 2025 13:26

Merge branch 'develop' into feature/model-parallel

ecc8fa0

ssmmnn11 approved these changes Jan 22, 2025

View reviewed changes

HCookie approved these changes Jan 22, 2025

View reviewed changes

HCookie assigned cathalobrien Jan 22, 2025

cathalobrien changed the title ~~Parallel inference~~ feat: Added the option to run inference in parallel Jan 22, 2025

cathalobrien merged commit e23934e into develop Jan 22, 2025
78 checks passed

cathalobrien deleted the feature/model-parallel branch January 22, 2025 15:39

cathalobrien mentioned this pull request Jan 23, 2025

Add the option to run parallel inference without slurm #112

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added the option to run inference in parallel #108

feat: Added the option to run inference in parallel #108

cathalobrien commented Jan 16, 2025 •

edited by github-actions bot

Loading

codecov-commenter commented Jan 16, 2025 •

edited

Loading

gmertes left a comment

jswijnands commented Jan 16, 2025

cathalobrien commented Jan 16, 2025 •

edited

Loading

ssmmnn11 left a comment

cathalobrien commented Jan 17, 2025

gmertes left a comment

HCookie left a comment

gmertes left a comment

HCookie left a comment

feat: Added the option to run inference in parallel #108

feat: Added the option to run inference in parallel #108

Conversation

cathalobrien commented Jan 16, 2025 • edited by github-actions bot Loading

codecov-commenter commented Jan 16, 2025 • edited Loading

Codecov Report

gmertes left a comment

Choose a reason for hiding this comment

jswijnands commented Jan 16, 2025

cathalobrien commented Jan 16, 2025 • edited Loading

ssmmnn11 left a comment

Choose a reason for hiding this comment

cathalobrien commented Jan 17, 2025

gmertes left a comment

Choose a reason for hiding this comment

HCookie left a comment

Choose a reason for hiding this comment

gmertes left a comment

Choose a reason for hiding this comment

HCookie left a comment

Choose a reason for hiding this comment

cathalobrien commented Jan 16, 2025 •

edited by github-actions bot

Loading

codecov-commenter commented Jan 16, 2025 •

edited

Loading

cathalobrien commented Jan 16, 2025 •

edited

Loading