PyTorchJob Generator

The Helm chart defined in this folder facilitates the configuration of PyTorch jobs for submission to an OpenShift cluster implementing MLBatch.

Invocations of this chart generate a PyTorchJob wrapped into an AppWrapper for better traceability and fault-tolerance.

Obtaining the Chart

To start with, add the mlbatch Helm chart repository.

helm repo add mlbatch https://project-codeflare.github.io/mlbatch
helm repo update

To verify the chart was installed correctly, search for AppWrapper.

helm search repo AppWrapper

You should see output similar to the following:

NAME                        	CHART VERSION	APP VERSION	DESCRIPTION
mlbatch/pytorchjob-generator	1.1.6        	v1beta2    	An AppWrapper generator for PyTorchJobs

Configuring the Job

Create a settings.yaml file with the settings for the PyTorch job, for example:

jobName: my-job               # name of the generated AppWrapper and PyTorchJob objects (required)
queueName: default-queue      # local queue to submit to (default: default-queue)

numPods: 4                    # total pod count including master and worker pods (default: 1)
numCpusPerPod: 500m           # requested number of cpus per pod (default: 1)
numGpusPerPod: 8              # requested number of gpus per pod (default: 0)
totalMemoryPerPod: 1Gi        # requested amount of memory per pod (default: 1Gi)

priority: default-priority    # default-priority (default), low-priority, or high-priority

# container image for the pods (required)
containerImage: ghcr.io/foundation-model-stack/base:pytorch-latest-nightly-20230126

# setup commands to run in each pod (optional)
setupCommands:                
- git clone https://github.com/dbarnett/python-helloworld
- cd python-helloworld

# main program to invoke via torchrun (optional)
mainProgram: helloworld.py

To learn more about the available settings see chart/README.md.

Submitting the Job

To submit the Pytorch job to the cluster using the settings.yaml file, run:

helm template -f settings.yaml mlbatch/pytorchjob-generator | oc create -f-

To optionally capture the generated AppWrapper specification as a generated.yaml file, run instead:

helm template -f settings.yaml mlbatch/pytorchjob-generator | tee generated.yaml | oc create -f-

To remove the PyTorch job from the cluster, delete the generated AppWrapper object:

oc delete appwrapper my-job

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PyTorchJob Generator

Obtaining the Chart

Configuring the Job

Submitting the Job

Files

README.md

Latest commit

History

README.md

File metadata and controls

PyTorchJob Generator

Obtaining the Chart

Configuring the Job

Submitting the Job