You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The text was updated successfully, but these errors were encountered:
satyaog
changed the title
Define what is the target number of jobs that can be run on the Mila cluster
Define what is the target number of jobs that can be run on the different clusters
Aug 17, 2021
For CC we might refer to their documentation for details.
Should we rephrase with "...for larger experiments that don't fit in Mila cluster..." ?
For Mila I guess this recommendation was relevant before AO2, when the number of GPUs was around 170. (a quick search shows this number of 5 was introduced in a pre-pandemic world, April 2019, https://github.com/mila-iqia/AI-HPC-Docs/commit/667d8097913821112fecd2aab24ed4154f069743). We have 500+ GPUs in the cluster now. Moreover, this is the job of the batch scheduler to distribute the jobs based on defined rules and QoS, not to the users to limit themselves.
In the "Data parallelism" section, it is implied that hundreds of jobs can be run. This is contradictory to what is explained above.
Originally posted by @tesfaldet in #46 (comment)
How do you define "larger experiments"? How many jobs?
Originally posted by @tesfaldet in #46 (comment)
The text was updated successfully, but these errors were encountered: