Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm on GKE - Guide #864

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

danielmarzini
Copy link

This guide shows you how to deploy Slurm on a Google Kubernetes Engine (GKE) cluster.

This guide is intended for platform administrators in an enterprise environment who are already managing Kubernetes or GKE clusters, and who need to set up Slurm clusters for AI/ML teams on Kubernetes. This guide is also for AI/ML startups that already use Kubernetes or GKE to run their workloads, such as inference workloads or web apps, and want to use their existing infrastructure to run training workloads with a Slurm interface.

slurm-on-gke/image/Dockerfile Show resolved Hide resolved
slurm-on-gke/.gitignore Outdated Show resolved Hide resolved
slurm-on-gke/.gitignore Outdated Show resolved Hide resolved
slurm-on-gke/README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,196 @@
/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a top-level generic module for infrastructure here https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/infrastructure

Is it possible to reuse that instead of creating another infrastructure module?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, the infra part is tied to the described example, I can do it but an a second release. wdyt?

slurm-on-gke/modules/slurm-cluster/main.tf Outdated Show resolved Hide resolved
@danielmarzini
Copy link
Author

danielmarzini commented Nov 4, 2024

Hello @andrewsykim I implemented the changes with 1 exception still open above, wdyt? Thanks!

@danielmarzini
Copy link
Author

@andrewsykim as agreed, can you also add @mwysokin as reviewer? Thanks!

@andrewsykim
Copy link
Collaborator

@andrewsykim as agreed, can you also add @mwysokin as reviewer? Thanks!

I'm unable to add him as a reviewer for some reason, but feel free to review anyways

Copy link

@mwysokin mwysokin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM 🖖

I think the module could be improved by making it more flexible when it comes to image versions because most of them are currently hardcoded but it's a nit and a possible room for improvement if people are going to use it.

@andrewsykim
Copy link
Collaborator

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants