Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC 2024] Scheduling AI workload among multiple clusters #369

Open
haoqing0110 opened this issue Mar 8, 2024 · 17 comments
Open

[GSoC 2024] Scheduling AI workload among multiple clusters #369

haoqing0110 opened this issue Mar 8, 2024 · 17 comments
Assignees

Comments

@haoqing0110
Copy link
Member

haoqing0110 commented Mar 8, 2024

This is one of GSoC 2024 projects.

Announcement
cncf/mentoring#1221

Google Summer of Code 2024 Timeline
https://developers.google.com/open-source/gsoc/timeline

Description

Open Cluster Management (OCM) focuses on multicluster and multicloud management scenarios for Kubernetes applications. Open APIs are evolving within this project for cluster registration, workload distribution, dynamic placement of policies and workloads, and much more. The placement concept is used to dynamically select a set of clusters so that higher level users can either replicate Kubernetes resources to the member clusters or run their advanced workload. For example: as an application developer, I can deploy my workload to clusters with the most allocatable memory and CPU.

Now, with the rise of AI technology, there’s a growing need to schedule AI workload based on GPU/TPU resources. In this project we want you to use the placement extensible scheduling mechanism to implement a GPU/TPU resource collector addon by addon template and provide an AddonPlacementScore to make placement decision based on GPU/TPU resources. We also want you to propose a customized external Kueue Admission Check controller to consume the placement decision to schedule AI workload among multiple clusters based on GPU/TPU resources.

Expected Outcome

  • Develop the GPU/TPU resource collector addon, which includes documentation of the addon architecture and describing the AddonPlacementScore usage. Also, implement the addon using the addon template and contribute the code to the addon-contrib repository.

  • Deliver a proposal for the external Kueue Admission Check controller. The proposal should outline the API design and explain how the controller uses the OCM scheduling result and interacts with Kueue. The proposal needs to be finally reviewed in OCM community meeting. Also, you need to deliver a prototype based on the proposal.

Recommended Skills

Golang, Kubernetes, Scheduling

Mentor(s)

Qing Hao (@haoqing0110, [email protected]) - primary
Jian Qiu (@qiujian16, [email protected])

References
Open Cluster Management
Placement concept
AddOn concept
Placement extensible scheduling mechanism
Build an addon with addon template
GPU on *KS, for example GPUs in GKE
Kueue Admission Check

Discussion
Feel free to raise your questions here. Can also reach out to us in the slack channel. Failed to join by the link? See solutions at #369 (comment) .

@Sayanjones
Copy link

Hi @haoqing0110, I am interested to work on this project. Can we discuss this further?

@Sayanjones
Copy link

I gone through the project, I got to know that it requires an addon to collect and score clusters based on GPU/TPU(contribute to addon-contrib). Propose an external Kueue Admission Check controller that uses OCM's placement decisions for scheduling (community review needed).

@z1ens
Copy link
Contributor

z1ens commented Mar 11, 2024

Hello @haoqing0110 :)
I am really interested in this GSoC project and looking forward to contribute useful code to OCM. 

##
About me: My name is Zhe Shen, and I am a third year undergraduate student of computer science in Germany. I am familiar with GO and also Kubernetes, I recently done a project to build a FaaS which integrated with Kubernetes environment from scratch, which can deploy functions, manage them and scale them easily.

##
I went through the OCM official page and tried some of the functions, including installing OCM, deploy Kubernetes resources on a specific cluster(Manifestwork) on a cluster, and also tried to create a Placement to manage set of cluster(distribute the deployments in both clusters), and they all done successfully.

##
After researching about addon templates, I have a few questions:


  1. How will the AddonPlacementScore algorithm evaluate clusters based on their GPU/TPU resources? What factors will it consider( utilization rates? Custom Metrics )?
  2. How will the AddonPlacementScore integrate with existing OCM scheduling mechanisms?
  3. How can we ensure the addon and controller are compatible with different Kubernetes distributions and versions?

All in all, I am aware that this project is more challenging then building a FaaS, and I am ready to learn and work on it! Thank you for your attention to read through it, looking forward to your reply.
p.s. I have noticed that in the website of OCM you are supporting documentation language in Chinese, I can try to maintain them as well since it’s my mother-language.

@haoqing0110
Copy link
Member Author

Hello @Sayanjones @z1ens, thanks for being interested in this project. Feel free to join our community slack channel if you want to have further discussion.


@z1ens Thank you for your question, below are some of my thought:

  1. The most basic is by the allocatable resource, as well as the usage. Metrics is a good idea, could do some investigation to see if it‘s feasible.
  2. The scheduling is logically divided into two phases internally: Predicate and Prioritize, using AddonPlacementScore to select the clusters one part of the progress. Hope the placement concept page makes it clear. And from code level, can refer to https://github.com/open-cluster-management-io/ocm/blob/main/pkg/placement/plugins/addon/addon.go to see how it works.
  3. In most cases I think a k8s upgrade should ensure its backward compatibility, and we also need to pay attention to any breaking changes.

@haoqing0110
Copy link
Member Author

cc @qiujian16

@k2nt
Copy link

k2nt commented Mar 11, 2024

Hi @haoqing0110, My name is Khai. I came across this project in GSOC24, and I would love to be a contributor. I tried to join the Slack page but I ran into the error "It looks like there isn’t an account on Kubernetes tied to this email address.". I look forward to discuss more with you!

@mikeshng
Copy link
Member

https://communityinviter.com/apps/kubernetes/community

@k2nt you can get an invite here for the Slack channel.

@k2nt
Copy link

k2nt commented Mar 11, 2024

Hi @mikeshng. Thank you for your email (and post)! I hope you can point me to the correct channel for this project (I assume that it is open-cluster-mgmt). I am posting here instead of replying via email so that other contributors can see this also.

@mikeshng
Copy link
Member

Thanks @k2nt yes, the channel is #open-cluster-mgmt

@z1ens
Copy link
Contributor

z1ens commented Mar 11, 2024

Hello, @haoqing0110
Thank you for your patience to answer my questions, your ideas sounds inspiring, I will take a look at the code, and I just joined the slack channel right now. Have a nice day!

@mikeshng
Copy link
Member

Hi all, @haoqing0110 is going to talk more about this topic in this week's community meeting.

Please feel free to ask any questions here or during the meeting.

You can find the community meeting schedule here:
https://calendar.google.com/calendar/u/0/[email protected]

@haoqing0110
Copy link
Member Author

This has been selected to participate in this year's Google Summer of Code! 🎉 cncf/mentoring#1221

@haoqing0110
Copy link
Member Author

/assign @z1ens

@ivan-cai
Copy link
Contributor

@qiujian16 @haoqing0110 resource-usage-collect agent needs to consider the available resources of each node, ometimes the cluster resources are sufficient, but the node resources are insufficient.

@haoqing0110
Copy link
Member Author

@ivan-cai yes, I suppose @z1ens 's PR open-cluster-management-io/addon-contrib#20 has changed to calculate the score based on the max node resource. We also had a discussion about whether need both cluster resource score and node resource score, it seems node resource score is more useful.

@z1ens
Copy link
Contributor

z1ens commented Aug 16, 2024

@ivan-cai Exactly as @haoqing0110 mentioned, I’ve implemented a scoring strategy in the resource-usage-collect-addon that includes both node scope and cluster scope scores. In Kubernetes, a job can only be scheduled if a single node in the cluster has resources >= the job's request. Therefore, linking the scoring mechanism to the node with the maximum available resources is logical. I also developed a cluster scope score that assesses the total available resources in the cluster, as sometimes cluster admins want to spread workloads across multiple clusters or nodes to enhance resource utilization.

@haoqing0110
Copy link
Member Author

Congratulations to @z1ens for completing the Google Summer of Code 2024 and contributing to the Open Cluster Management community.

The following PRs have been merged to our repos:
GPU/TPU-resource-usage-collect-addon
OCM Kueue Admission Check Controller

These contributions are also an important part of two KubeCon topics.
Connecting the Dots: Towards a Unified Multi-Cluster AI/ML Experience
Boundaryless Computing: Optimizing LLM Performance, Cost and Efficiency in Multi-Cloud Architecture

Thanks again for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

6 participants