Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added the option to run inference in parallel #108

Merged
merged 30 commits into from
Jan 22, 2025
Merged
Changes from 1 commit
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
ce74e6e
model parallel wip
cathalobrien Nov 25, 2024
936c60a
logging only on rank 0
cathalobrien Nov 26, 2024
d870289
fallback if env vars arent set and some work only done by rank 0
cathalobrien Nov 26, 2024
b39b796
changelog
cathalobrien Nov 26, 2024
b95e167
pre-commit checks and no model comm group for single gpu case
cathalobrien Nov 26, 2024
9fe691c
changelog
cathalobrien Nov 26, 2024
5f92574
added parallel inf
cathalobrien Jan 14, 2025
71fdf0e
precommit
cathalobrien Jan 14, 2025
9264754
9k parallel inference works
cathalobrien Jan 15, 2025
06a575d
refactor
cathalobrien Jan 15, 2025
fa89bb8
refactor
cathalobrien Jan 15, 2025
a6a4ea4
tidy
cathalobrien Jan 16, 2025
8a73f62
more compatible with older versions of models
cathalobrien Jan 16, 2025
db560eb
forgot precommit
cathalobrien Jan 16, 2025
b21d811
remove commented code
cathalobrien Jan 16, 2025
48ad37b
added license
cathalobrien Jan 16, 2025
b9ecc14
feedback
cathalobrien Jan 16, 2025
1a0ae49
Merge remote-tracking branch 'origin/develop' into feature/model-para…
cathalobrien Jan 17, 2025
27965ff
refactor to parallel runner
cathalobrien Jan 17, 2025
43167c5
refactored into explicit parallel runner class
cathalobrien Jan 17, 2025
6974ac3
allow MASTER_ADDR and MASTER_PORT to be set as env vars before runtime
cathalobrien Jan 20, 2025
2016c7b
readd line accicdentally deleted
cathalobrien Jan 21, 2025
bd391f5
added documentation
cathalobrien Jan 21, 2025
1cd4982
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 21, 2025
079036a
forgot precommit
cathalobrien Jan 21, 2025
d6a77ff
Merge branch 'feature/model-parallel' of github.com:ecmwf/anemoi-infe…
cathalobrien Jan 21, 2025
b8be926
docs feedback
cathalobrien Jan 21, 2025
5dd8a55
added a link to parallel inference to index
cathalobrien Jan 22, 2025
861161d
Ensure each model has the same seed
cathalobrien Jan 22, 2025
ecc8fa0
Merge branch 'develop' into feature/model-parallel
cathalobrien Jan 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 41 additions & 32 deletions docs/parallel.rst
Original file line number Diff line number Diff line change
@@ -1,54 +1,64 @@
###################
####################
Parallel Inference
###################
####################

If the memory requirements of your model are too large to fit within a single GPU, you run Anemoi-Inference in parallel across multiple GPUs.
If the memory requirements of your model are too large to fit within a
single GPU, you run Anemoi-Inference in parallel across multiple GPUs.

cathalobrien marked this conversation as resolved.
Show resolved Hide resolved
Parallel inference requires SLURM to launch the parallel processes and to determine information about your network environment. If SLURM is not available to you, please create an issue on the Anemoi-Inference github page.
Parallel inference requires SLURM to launch the parallel processes and
to determine information about your network environment. If SLURM is not
available to you, please create an issue on the Anemoi-Inference github
page.

cathalobrien marked this conversation as resolved.
Show resolved Hide resolved
**************
***************
Configuration
**************
***************

To run in parallel, you must add '`runner:parallel`' to your inference config file.
To run in parallel, you must add '`runner:parallel`' to your inference
config file.

.. code:: yaml

checkpoint: /path/to/inference-last.ckpt
lead_time: 60
runner: parallel
input:
grib: /path/to/input.grib
output:
grib: /path/to/output.grib
checkpoint: /path/to/inference-last.ckpt
lead_time: 60
runner: parallel
input:
grib: /path/to/input.grib
output:
grib: /path/to/output.grib




******************************
*******************************
Running inference in parallel
******************************
*******************************

Below is an example SLURM batch script to launch a parallel inference job across 4 GPUs.
Below is an example SLURM batch script to launch a parallel inference
job across 4 GPUs.

.. code:: bash

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=0:05:00
#SBATCH --output=outputs/paralell_inf.%j.out
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=0:05:00
#SBATCH --output=outputs/paralell_inf.%j.out

cathalobrien marked this conversation as resolved.
Show resolved Hide resolved
source /path/to/venv/bin/activate
srun anemoi-inference run parallel.yaml
source /path/to/venv/bin/activate
srun anemoi-inference run parallel.yaml

.. warning::
If you specify '`runner:parallel`' but you don't launch with '`srun`', your anemoi-inference job may hang as only 1 process will be launched.

If you specify '`runner:parallel`' but you don't launch with
'`srun`', your anemoi-inference job may hang as only 1 process will
be launched.

.. note::
By default, anemoi-inference will determine your systems master address and port itself. If this fails (i.e. when running Anemoi-Inference inside a container), you can instead set these values yourself via environment variables in your SLURM batch script:

By default, anemoi-inference will determine your systems master
address and port itself. If this fails (i.e. when running
Anemoi-Inference inside a container), you can instead set these
values yourself via environment variables in your SLURM batch script:

.. code:: bash

Expand All @@ -57,4 +67,3 @@ Below is an example SLURM batch script to launch a parallel inference job across
export MASTER_PORT=$((10000 + RANDOM % 10000))

srun anemoi-inference run parallel.yaml

Loading