-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
document MLBatch for RHOAI 2.15 (#108)
- Loading branch information
1 parent
173f8b2
commit 5e853a7
Showing
14 changed files
with
826 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,148 @@ | ||
# Cluster Setup | ||
|
||
The cluster setup installs Red Hat OpenShift AI and Coscheduler, configures Kueue, | ||
cluster roles, and priority classes. | ||
|
||
If MLBatch is deployed on a cluster that used to run earlier versions of ODH, | ||
[MCAD](https://github.com/project-codeflare/mcad), Red Hat OpenShift AI, or Coscheduler, | ||
make sure to scrub traces of these installations. In particular, make sure to | ||
delete the following custom resource definitions (CRD) if present on the | ||
cluster. Make sure to delete all instances prior to deleting the CRDs: | ||
```sh | ||
# Delete old appwrappers and crd | ||
oc delete appwrappers --all -A | ||
oc delete crd appwrappers.workload.codeflare.dev | ||
|
||
# Delete old noderesourcetopologies and crd | ||
oc delete noderesourcetopologies --all -A | ||
oc delete crd noderesourcetopologies.topology.node.k8s.io | ||
``` | ||
|
||
## Priorities | ||
|
||
Create `default-priority`, `high-priority`, and `low-priority` priority classes: | ||
```sh | ||
oc apply -f setup.RHOAI-v2.15/mlbatch-priorities.yaml | ||
``` | ||
|
||
## Coscheduler | ||
|
||
Install Coscheduler v0.28.9 as a secondary scheduler and configure packing: | ||
```sh | ||
helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \ | ||
scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \ | ||
--set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"}]' | ||
``` | ||
Patch Coscheduler pod priorities: | ||
```sh | ||
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.15/coscheduler-priority-patch.yaml scheduler-plugins-controller | ||
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.15/coscheduler-priority-patch.yaml scheduler-plugins-scheduler | ||
``` | ||
|
||
## Red Hat OpenShift AI | ||
|
||
Create the Red Hat OpenShift AI subscription: | ||
```sh | ||
oc apply -f setup.RHOAI-v2.15/mlbatch-subscription.yaml | ||
```` | ||
Identify install plan: | ||
```sh | ||
oc get ip -n redhat-ods-operator | ||
``` | ||
``` | ||
NAMESPACE NAME CSV APPROVAL APPROVED | ||
redhat-ods-operator install-kmh8w rhods-operator.2.10.0 Manual false | ||
``` | ||
Approve install plan replacing the generated plan name below with the actual | ||
value: | ||
```sh | ||
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kmh8w | ||
``` | ||
Create DSC Initialization: | ||
```sh | ||
oc apply -f setup.RHOAI-v2.15/mlbatch-dsci.yaml | ||
``` | ||
Create Data Science Cluster: | ||
```sh | ||
oc apply -f setup.RHOAI-v2.15/mlbatch-dsc.yaml | ||
``` | ||
The provided DSCI and DSC are intended to install a minimal set of Red Hat OpenShift | ||
AI managed components: `codeflare`, `kueue`, `ray`, and `trainingoperator`. The | ||
remaining components such as `dashboard` can be optionally enabled. | ||
|
||
The configuration of the managed components differs from the default Red Hat OpenShift | ||
AI configuration as follows: | ||
- Kubeflow Training Operator: | ||
- `gang-scheduler-name` is set to `scheduler-plugins-scheduler`, | ||
- Kueue: | ||
- `manageJobsWithoutQueueName` is enabled, | ||
- `batch/job` integration is disabled, | ||
- `waitForPodsReady` is disabled, | ||
- `LendingLimit` feature gate is enabled, | ||
- `enableClusterQueueResources` metrics is enabled, | ||
- Codeflare operator: | ||
- the AppWrapper controller is enabled and configured as follows: | ||
- `userRBACAdmissionCheck` is disabled, | ||
- `schedulerName` is set to `scheduler-plugins-scheduler`, | ||
- `queueName` is set to `default-queue`, | ||
- `slackQueueName` is set to `slack-cluster-queue` | ||
- pod priorities, resource requests and limits have been adjusted. | ||
|
||
|
||
|
||
## Kueue Configuration | ||
|
||
Create Kueue's default flavor: | ||
```sh | ||
oc apply -f setup.RHOAI-v2.15/default-flavor.yaml | ||
``` | ||
|
||
## Cluster Role | ||
|
||
Create `mlbatch-edit` role: | ||
```sh | ||
oc apply -f setup.RHOAI-v2.15/mlbatch-edit-role.yaml | ||
``` | ||
|
||
## Slack Cluster Queue | ||
|
||
Create the designated slack `ClusterQueue` which will be used to automate | ||
minor adjustments to cluster capacity caused by node failures and | ||
scheduler maintanence. | ||
```sh | ||
oc apply -f- << EOF | ||
apiVersion: kueue.x-k8s.io/v1beta1 | ||
kind: ClusterQueue | ||
metadata: | ||
name: slack-cluster-queue | ||
spec: | ||
namespaceSelector: {} | ||
cohort: default-cohort | ||
preemption: | ||
withinClusterQueue: LowerOrNewerEqualPriority | ||
reclaimWithinCohort: Any | ||
borrowWithinCohort: | ||
policy: Never | ||
resourceGroups: | ||
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"] | ||
flavors: | ||
- name: default-flavor | ||
resources: | ||
- name: "cpu" | ||
nominalQuota: 8000m | ||
- name: "memory" | ||
nominalQuota: 128Gi | ||
- name: "nvidia.com/gpu" | ||
nominalQuota: 8 | ||
- name: "nvidia.com/roce_gdr" | ||
nominalQuota: 1 | ||
- name: "pods" | ||
nominalQuota: 100 | ||
EOF | ||
``` | ||
Edit the above quantities to adjust the quota to the desired | ||
values. Pod counts are optional and can be omitted from the list of | ||
covered resources. The `lendingLimit` for each resource will be | ||
dynamically adjusted by the MLBatch system to reflect reduced cluster | ||
capacity. See [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) for a | ||
detailed discussion of the role of the slack `ClusterQueue`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
# Team Setup | ||
|
||
A *team* in MLBatch is a group of users that share a resource quota. | ||
|
||
Before setting up your teams and quotas, please read [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) | ||
for a discussion of our recommended best practices. | ||
|
||
|
||
Setting up a new team requires the cluster admin to create a project, | ||
a user group, a quota, a queue, and the required role bindings as described below. | ||
|
||
Create project: | ||
```sh | ||
oc new-project team1 | ||
``` | ||
Create user group: | ||
```sh | ||
oc adm groups new team1-edit-group | ||
``` | ||
Add users to group for example: | ||
```sh | ||
oc adm groups add-users team1-edit-group user1 | ||
``` | ||
Bind cluster role to group in namespace: | ||
```sh | ||
oc adm policy add-role-to-group mlbatch-edit team1-edit-group --role-namespace="" --namespace team1 | ||
``` | ||
|
||
Specify the intended quota for the namespace by creating a `ClusterQueue`: | ||
```sh | ||
oc apply -f- << EOF | ||
apiVersion: kueue.x-k8s.io/v1beta1 | ||
kind: ClusterQueue | ||
metadata: | ||
name: team1-cluster-queue | ||
spec: | ||
namespaceSelector: {} | ||
cohort: default-cohort | ||
preemption: | ||
withinClusterQueue: LowerOrNewerEqualPriority | ||
reclaimWithinCohort: Any | ||
borrowWithinCohort: | ||
policy: Never | ||
resourceGroups: | ||
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"] | ||
flavors: | ||
- name: default-flavor | ||
resources: | ||
- name: "cpu" | ||
nominalQuota: 8000m | ||
# borrowingLimit: 0 | ||
# lendingLimit: 0 | ||
- name: "memory" | ||
nominalQuota: 128Gi | ||
# borrowingLimit: 0 | ||
# lendingLimit: 0 | ||
- name: "nvidia.com/gpu" | ||
nominalQuota: 16 | ||
# borrowingLimit: 0 | ||
# lendingLimit: 0 | ||
- name: "nvidia.com/roce_gdr" | ||
nominalQuota: 4 | ||
# borrowingLimit: 0 | ||
# lendingLimit: 0 | ||
- name: "pods" | ||
nominalQuota: 100 | ||
# borrowingLimit: 0 | ||
# lendingLimit: 0 | ||
EOF | ||
``` | ||
Edit the above quantities to adjust the quota to the desired values. Pod counts | ||
are optional and can be omitted from the list of covered resources. | ||
|
||
Uncomment all `borrowingLimit` lines to prevent this namespace from borrowing | ||
quota from other namespaces. Uncomment all `lendingLimit` lines to prevent other | ||
namespaces from borrowing quota from this namespace. | ||
|
||
Create a `LocalQueue` to bind the `ClusterQueue` to the namespace: | ||
```sh | ||
oc apply -n team1 -f- << EOF | ||
apiVersion: kueue.x-k8s.io/v1beta1 | ||
kind: LocalQueue | ||
metadata: | ||
name: default-queue | ||
spec: | ||
clusterQueue: team1-cluster-queue | ||
EOF | ||
``` | ||
We recommend naming the local queue `default-queue` as `AppWrappers` will | ||
default to this queue name. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Uninstall | ||
|
||
***First, remove all team projects and corresponding cluster queues.*** | ||
|
||
Then to uninstall the MLBatch controllers and reclaim the corresponding | ||
namespaces, run: | ||
```sh | ||
# OpenShift AI uninstall | ||
oc delete dsc mlbatch-dsc | ||
oc delete dsci mlbatch-dsci | ||
oc delete subscription -n redhat-ods-operator rhods-operator | ||
oc delete csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator | ||
oc delete crd featuretrackers.features.opendatahub.io \ | ||
dscinitializations.dscinitialization.opendatahub.io \ | ||
datascienceclusters.datasciencecluster.opendatahub.io | ||
oc delete operators rhods-operator.redhat-ods-operator | ||
oc delete operatorgroup -n redhat-ods-operator rhods-operator | ||
oc delete namespace redhat-ods-applications redhat-ods-monitoring redhat-ods-operator | ||
|
||
# Coscheduler uninstall | ||
helm uninstall -n scheduler-plugins scheduler-plugins | ||
oc delete namespace scheduler-plugins | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Upgrading from RHOAI 2.14 | ||
|
||
These instructions assume you installed and configured RHOAI 2.14 following | ||
the MLBatch [install instructions for RHOAI-v2.14](../setup.RHOAI-v2.14/CLUSTER-SETUP.md) | ||
and are subscribed to the fast channel. | ||
|
||
Your subscription will have automatically created an unapproved | ||
install plan to upgrade to RHOAI 2.15. | ||
|
||
Before beginning, verify that the expected install plan exists: | ||
```sh | ||
oc get ip -n redhat-ods-operator | ||
``` | ||
Typical output would be: | ||
```sh | ||
NAME CSV APPROVAL APPROVED | ||
install-kpzzl rhods-operator.2.15.0 Manual false | ||
install-nqrbp rhods-operator.2.14.0 Manual true | ||
``` | ||
|
||
Assuming the install plan exists you can begin the upgrade process. | ||
|
||
There are no MLBatch modifications to the default RHOAI configuration maps | ||
beyond those already made in previous installs. Therefore, you can simply | ||
approve the install plan replacing the example plan name below with the actual | ||
value on your cluster: | ||
```sh | ||
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kpzzl | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
- op: add | ||
path: /spec/template/spec/priorityClassName | ||
value: system-node-critical |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
apiVersion: kueue.x-k8s.io/v1beta1 | ||
kind: ResourceFlavor | ||
metadata: | ||
name: default-flavor |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
apiVersion: datasciencecluster.opendatahub.io/v1 | ||
kind: DataScienceCluster | ||
metadata: | ||
name: mlbatch-dsc | ||
spec: | ||
components: | ||
codeflare: | ||
managementState: Managed | ||
dashboard: | ||
managementState: Removed | ||
datasciencepipelines: | ||
managementState: Removed | ||
kserve: | ||
managementState: Removed | ||
serving: | ||
ingressGateway: | ||
certificate: | ||
type: SelfSigned | ||
managementState: Removed | ||
name: knative-serving | ||
kueue: | ||
managementState: Managed | ||
modelmeshserving: | ||
managementState: Removed | ||
ray: | ||
managementState: Managed | ||
trainingoperator: | ||
managementState: Managed | ||
trustyai: | ||
managementState: Removed | ||
workbenches: | ||
managementState: Removed |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
apiVersion: dscinitialization.opendatahub.io/v1 | ||
kind: DSCInitialization | ||
metadata: | ||
name: mlbatch-dsci | ||
spec: | ||
applicationsNamespace: redhat-ods-applications | ||
monitoring: | ||
managementState: Managed | ||
namespace: redhat-ods-monitoring | ||
serviceMesh: | ||
managementState: Removed | ||
trustedCABundle: | ||
customCABundle: "" | ||
managementState: Managed |
Oops, something went wrong.