Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix slurm multinode example #3229

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ffrancesco94
Copy link

@ffrancesco94 ffrancesco94 commented Nov 8, 2024

What does this PR do?

The SLURM multinode submit script doesn't work for multiple reasons (see also #1239 which is still open). This PR aims at solving some of those issues, namely:

  • Typo in the $CMD command
  • $SLURM_NNODES was recently deprecated in favour of $SLURM_JOB_NUM_NODES
  • If multinode setup involves multiple GPUs, it has to be enforced with --multi-gpu and the rank of each process has to be set with --machine-rank
  • If multiple GPUs are present, the complete_nlp_example doesn't handle distributed evaluation correctly. In particular, either the evaluate GLUE mrpc metric has to be loaded with the --num_process and --process_id flags (which will subsequently fail due to several breakages due to recent datasets  evaluate#542 & related) or metric computation has to happen only on the main process. This PR goes for the second.

Fixes #3206

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings. NA
  • Did you write any new necessary tests? NA

Who can review?

@muellerzr or @SunMarc

Compute metric with evaluate from main process only, avoiding bug in multinode evaluate.
Enforce --multi_gpu on multiple nodes. Moreover, make sure that each rank gets correctly addressed based on the $SLURM_PROCID. $SLURM_NNODES has now been deprecated and replaced by $SLURM_JOB_NUM_NODES. Fixed typo in $CMD as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multinode, multigpu example fails
1 participant