You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This workflow run exposed an issue with our current workflow: both JAX and Pallas unit test calls the _runner_ondemand_slurm.yaml workflow to create A100 runners. If two such calls happens in fast succession, they ended up creating two runners that may be scheduled by the SLURM cluster at the same time while having identical names (A100-${{ github_run_id }}), thus causing issue for the actual job to properly landed in the runner (more detail to be discovered here).
To fix potential conflicts between runners launched this way, the runner need to have different names, i.e. having a UUID as part of the name, etc.
The text was updated successfully, but these errors were encountered:
yhtang
changed the title
Unique name for SLURM-launched A100 runners
Unique name and logfile for SLURM-launched A100 runners
Feb 14, 2024
This workflow run exposed an issue with our current workflow: both JAX and Pallas unit test calls the
_runner_ondemand_slurm.yaml
workflow to create A100 runners. If two such calls happens in fast succession, they ended up creating two runners that may be scheduled by the SLURM cluster at the same time while having identical names (A100-${{ github_run_id }}
), thus causing issue for the actual job to properly landed in the runner (more detail to be discovered here).To fix potential conflicts between runners launched this way, the runner need to have different names, i.e. having a UUID as part of the name, etc.
The text was updated successfully, but these errors were encountered: