-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slurm on GKE - Guide #864
base: main
Are you sure you want to change the base?
Slurm on GKE - Guide #864
Conversation
@@ -0,0 +1,196 @@ | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a top-level generic module for infrastructure here https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/infrastructure
Is it possible to reuse that instead of creating another infrastructure module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately, the infra part is tied to the described example, I can do it but an a second release. wdyt?
Hello @andrewsykim I implemented the changes with 1 exception still open above, wdyt? Thanks! |
@andrewsykim as agreed, can you also add @mwysokin as reviewer? Thanks! |
This reverts commit 76703f0.
I'm unable to add him as a reviewer for some reason, but feel free to review anyways |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM 🖖
I think the module could be improved by making it more flexible when it comes to image versions because most of them are currently hardcoded but it's a nit and a possible room for improvement if people are going to use it.
/gcbrun |
This guide shows you how to deploy Slurm on a Google Kubernetes Engine (GKE) cluster.
This guide is intended for platform administrators in an enterprise environment who are already managing Kubernetes or GKE clusters, and who need to set up Slurm clusters for AI/ML teams on Kubernetes. This guide is also for AI/ML startups that already use Kubernetes or GKE to run their workloads, such as inference workloads or web apps, and want to use their existing infrastructure to run training workloads with a Slurm interface.