Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a workload manager to GPU cluster #28

Open
PGijsbers opened this issue Jul 25, 2024 · 0 comments
Open

Add a workload manager to GPU cluster #28

PGijsbers opened this issue Jul 25, 2024 · 0 comments
Labels
automation CI/CD and other automation

Comments

@PGijsbers
Copy link
Member

PGijsbers commented Jul 25, 2024

Our GPU server is shared with the AutoML group, but does not have a workload manager. Currently, that means that largely division of resources happens over chat and/or unwritten rules (we currently have 2 GPUs reserved by default). This is incredibly wasteful, but also makes it hard to scale up experiments later on. We want a job scheduler installed so that everyone that needs to run GPU jobs can simply queue requested jobs and we do not need to manually ensure people are not using the same physical resources.

Overall, the server is mainly intended for prototype testing, so the workload manager should allow quick turn-around time when reasonable for all users. Allowing users to explicitly set some job priority for this is OK, as we only have a small number of users that shouldn't abuse this.

I am not sure what workload manager is most appropriate, but I think everyone on our team is already familiar with SLURM.

@PGijsbers PGijsbers added the automation CI/CD and other automation label Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automation CI/CD and other automation
Projects
None yet
Development

No branches or pull requests

1 participant