Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unique name and logfile for SLURM-launched A100 runners #525

Open
yhtang opened this issue Feb 8, 2024 · 0 comments
Open

Unique name and logfile for SLURM-launched A100 runners #525

yhtang opened this issue Feb 8, 2024 · 0 comments

Comments

@yhtang
Copy link
Collaborator

yhtang commented Feb 8, 2024

This workflow run exposed an issue with our current workflow: both JAX and Pallas unit test calls the _runner_ondemand_slurm.yaml workflow to create A100 runners. If two such calls happens in fast succession, they ended up creating two runners that may be scheduled by the SLURM cluster at the same time while having identical names (A100-${{ github_run_id }}), thus causing issue for the actual job to properly landed in the runner (more detail to be discovered here).

To fix potential conflicts between runners launched this way, the runner need to have different names, i.e. having a UUID as part of the name, etc.

@yhtang yhtang changed the title Unique name for SLURM-launched A100 runners Unique name and logfile for SLURM-launched A100 runners Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant