The Helm chart defined in this folder facilitates the configuration of PyTorch jobs for submission to an OpenShift cluster implementing MLBatch.
Invocations of this chart generate a PyTorchJob
wrapped into an AppWrapper
for better traceability and fault-tolerance.
To start with, add the mlbatch
Helm chart repository.
helm repo add mlbatch https://project-codeflare.github.io/mlbatch
helm repo update
To verify the chart was installed correctly, search for AppWrapper
.
helm search repo AppWrapper
You should see output similar to the following:
NAME CHART VERSION APP VERSION DESCRIPTION
mlbatch/pytorchjob-generator 1.1.6 v1beta2 An AppWrapper generator for PyTorchJobs
Create a settings.yaml
file with the settings for the PyTorch job, for
example:
jobName: my-job # name of the generated AppWrapper and PyTorchJob objects (required)
queueName: default-queue # local queue to submit to (default: default-queue)
numPods: 4 # total pod count including master and worker pods (default: 1)
numCpusPerPod: 500m # requested number of cpus per pod (default: 1)
numGpusPerPod: 8 # requested number of gpus per pod (default: 0)
totalMemoryPerPod: 1Gi # requested amount of memory per pod (default: 1Gi)
priority: default-priority # default-priority (default), low-priority, or high-priority
# container image for the pods (required)
containerImage: ghcr.io/foundation-model-stack/base:pytorch-latest-nightly-20230126
# setup commands to run in each pod (optional)
setupCommands:
- git clone https://github.com/dbarnett/python-helloworld
- cd python-helloworld
# main program to invoke via torchrun (optional)
mainProgram: helloworld.py
To learn more about the available settings see chart/README.md.
To submit the Pytorch job to the cluster using the settings.yaml
file, run:
helm template -f settings.yaml mlbatch/pytorchjob-generator | oc create -f-
To optionally capture the generated AppWrapper
specification as a
generated.yaml
file, run instead:
helm template -f settings.yaml mlbatch/pytorchjob-generator | tee generated.yaml | oc create -f-
To remove the PyTorch job from the cluster, delete the generated AppWrapper
object:
oc delete appwrapper my-job