pytorchjob-generator

An AppWrapper generator for PyTorchJobs

Overview

This file documents the variables that may be set in a user's settings.yaml to customize the Jobs generated by the tool.

Values

Job Metadata

Key	Type	Default	Description
jobName	string	must be provided by user	Name of the Job. Will be the name of the AppWrapper and the PyTorchJob.
namespace	string	`nil`	Namespace in which to run the Job. If unspecified, the namespace will be inferred using normal Helm/Kubernetes mechanisms when the Job is submitted.
queueName	string	`"default-queue"`	Name of the local queue to which the Job will be submitted.
priority	string	`"default-priority"`	Type of priority for the job (choose from: "default-priority", "low-priority" or "high-priority").
customLabels	array	`nil`	Optional array of custom labels to add to all the resources created by the Job (the PyTorchJob, the PodGroup, and the AppWrapper).
containerImage	string	must be provided by the user	Image used for creating the Job's containers (needs to have all the applications your job may need)
imagePullSecrets	array	`nil`	List of image-pull-secrets to be used for pulling containerImages

Resource Requirements

Key	Type	Default	Description
numPods	integer	`1`	Total number of pods (i.e. master + worker pods) to be created
numCpusPerPod	integer or string	`1`	Number of CPUs for each pod. May be a positive integer or a ResourceQuantity (eg 500m)
numGpusPerPod	integer	`0`	Number of GPUs for each pod (all GPUs per node is currently recommended for distributed training).
totalMemoryPerPod	string	`"1Gi"`	Total memory for each pod expressed as a ResourceQuantity (eg 1Gi, 200M, etc.).
limitCpusPerPod	integer or string	numCpusPerPod	Limit on the number of CPUs per pod for elastic jobs. May be a positive integer or a ResourceQuantity (eg 500m).
limitGpusPerPod	integer	numGpusPerPod	Limit of number of GPUs per pod for elastic jobs.
limitMemoryPerPod	string	totalMemoryPerPod	Limit of total memory per pod for elastic jobs (eg 1Gi, 200M, etc.).

Workload Specification

Key	Type	Default	Description
environmentVariables	array	`nil`	List of variables/values to be defined for all the ranks. Values can be literals or references to Kuberetes secrets. See values.yaml for examples of supported syntaxes. NOTE: The following standard PyTorch Distributed environment variables are set automatically and can be referenced in the commands without being set manually: WORLD_SIZE, RANK, MASTER_ADDR, MASTER_PORT.
sshGitCloneConfig	object	`nil`	Private GitHub clone support. See values.yaml for additional instructions.
setupCommands	array	no custom commands are executed	List of custom commands to be ran at the beginning of the execution. Use `setupCommand` to clone code, download data, and change directories.
mainProgram	string	`nil`	Name of the PyTorch program to be executed by `torchrun`. Please provide your program name here and NOT in "setupCommands" as this helm template provides the necessary "torchrun" arguments for the parallel execution. WARNING: this program is relative to the current path set by change-of-directory commands in "setupCommands". If no value is provided; then only `setupCommands` are executed and torchrun is elided.
volumes	array	No volumes are mounted	List of "(name, claimName, mountPath)" of volumes, with persistentVolumeClaim, to be mounted to the infrastructure

Advanced Options

Key	Type	Default	Description
roceGdrResName	string	nvidia.com/roce_gdr	RoCE GDR resource name (can vary by cluster configuration)
numRoceGdr	integer	`0`	number of nvidia.com/roce_grd resources (0 means disabled; >0 means enable GDR over RoCE). Must be 0 unless numPods > 1.
topologyFileConfigMap	string	`nil`	Name of configmap containining /var/run/nvidia-topologyd/virtualTopology.xml for the system e.g. nvidia-topo-gdr
ncclGdrEnvConfigMap	string	`nil`	Name of configmap containing NCCL networking environment variables for the system e.g. nccl-netwk-env-vars
multiNicNetworkName	string	`nil`	Name of multi-NIC network, if one is available. Note: when GDR over RoCE is used/available, the RoCE multi-nic network instance should be specified here instead of the TCP multi-nic network instance. Existing instance names can be listed with `oc get multinicnetwork`.
disableSharedMemory	boolean	`false`	Control whether or not a shared memory volume is added to the PyTorchJob.
mountNVMe	object	`nil`	Mount NVMe as a volume. The environment variable MOUNT_PATH_NVME provides the runtime mount path
initContainers	array	`nil`	List of "(name, image, command[])" specifying an init containers to be run before the main job. The 'command' field is a list of commands to run in the container, see the Kubernetes entry on initContainers for reference.
autopilotHealthChecks	array	No pre-flight checks are enabled.	Autopilot health checks. List of labels enabling one or more system health pre-flight checks.
hostIgnoreList	array	`nil`	List of host names on which the Job must not be scheduled (to avoid faulty nodes).
schedulerName	string	`nil`	If non-nil, use the specified Kubernetes scheduler. Setting this to the default-scheduler may result in GPU fragmentation on the cluster. Setting this to any non-nil value should only be done when explicitly directed to do so by a cluster admin!
serviceAccountName	string	the default service account for the namespace will be used.	Service account to be used for running the Job

Fault Tolerance

Key	Type	Default	Description
admissionGracePeriodDuration	string	The AppWrapper defaults will be used	Customize the admissionGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
warmupGracePeriodDuration	string	The AppWrapper defaults will be used	Customize the warmupGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
failureGracePeriodDuration	string	The AppWrapper defaults will be used	Customize the failureGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
retryPausePeriodDuration	string	The AppWrapper defaults will be used	Customize the retryPausePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
retryLimit	integer	The AppWrapper defaults will be used	Customize the retryLimit; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
forcefulDeletionGracePeriodDuration	string	The AppWrapper defaults will be used	Customize the forcefulDelectionGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
deletionOnFailureGracePeriodDuration	string	The AppWrapper defaults will be used	Customize the deletionOnFailureGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
restartPolicy	string	`"Never"`	Set Kubernertes policy for restarting failed containers "in place" (without restarting the Pod).
terminationGracePeriodSeconds	integer	Kubernetes's default value is used	Set a non-default pod termination grace period (in seconds).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

pytorchjob-generator

Overview

Values

Job Metadata

Resource Requirements

Workload Specification

Advanced Options

Fault Tolerance

Files

README.md

Latest commit

History

README.md

File metadata and controls

pytorchjob-generator

Overview

Values

Job Metadata

Resource Requirements

Workload Specification

Advanced Options

Fault Tolerance