Skip to content

Latest commit

 

History

History
77 lines (62 loc) · 7.89 KB

File metadata and controls

77 lines (62 loc) · 7.89 KB

pytorchjob-generator

An AppWrapper generator for PyTorchJobs

Version: 1.1.6 Type: application AppVersion: v1beta2

Overview

This file documents the variables that may be set in a user's settings.yaml to customize the Jobs generated by the tool.

Values

Job Metadata

Key Type Default Description
jobName string must be provided by user Name of the Job. Will be the name of the AppWrapper and the PyTorchJob.
namespace string nil Namespace in which to run the Job. If unspecified, the namespace will be inferred using normal Helm/Kubernetes mechanisms when the Job is submitted.
queueName string "default-queue" Name of the local queue to which the Job will be submitted.
priority string "default-priority" Type of priority for the job (choose from: "default-priority", "low-priority" or "high-priority").
customLabels array nil Optional array of custom labels to add to all the resources created by the Job (the PyTorchJob, the PodGroup, and the AppWrapper).
containerImage string must be provided by the user Image used for creating the Job's containers (needs to have all the applications your job may need)
imagePullSecrets array nil List of image-pull-secrets to be used for pulling containerImages

Resource Requirements

Key Type Default Description
numPods integer 1 Total number of pods (i.e. master + worker pods) to be created
numCpusPerPod integer or string 1 Number of CPUs for each pod. May be a positive integer or a ResourceQuantity (eg 500m)
numGpusPerPod integer 0 Number of GPUs for each pod (all GPUs per node is currently recommended for distributed training).
totalMemoryPerPod string "1Gi" Total memory for each pod expressed as a ResourceQuantity (eg 1Gi, 200M, etc.).
limitCpusPerPod integer or string numCpusPerPod Limit on the number of CPUs per pod for elastic jobs. May be a positive integer or a ResourceQuantity (eg 500m).
limitGpusPerPod integer numGpusPerPod Limit of number of GPUs per pod for elastic jobs.
limitMemoryPerPod string totalMemoryPerPod Limit of total memory per pod for elastic jobs (eg 1Gi, 200M, etc.).

Workload Specification

Key Type Default Description
environmentVariables array nil List of variables/values to be defined for all the ranks. Values can be literals or references to Kuberetes secrets. See values.yaml for examples of supported syntaxes. NOTE: The following standard PyTorch Distributed environment variables are set automatically and can be referenced in the commands without being set manually: WORLD_SIZE, RANK, MASTER_ADDR, MASTER_PORT.
sshGitCloneConfig object nil Private GitHub clone support. See values.yaml for additional instructions.
setupCommands array no custom commands are executed List of custom commands to be ran at the beginning of the execution. Use setupCommand to clone code, download data, and change directories.
mainProgram string nil Name of the PyTorch program to be executed by torchrun. Please provide your program name here and NOT in "setupCommands" as this helm template provides the necessary "torchrun" arguments for the parallel execution. WARNING: this program is relative to the current path set by change-of-directory commands in "setupCommands". If no value is provided; then only setupCommands are executed and torchrun is elided.
volumes array No volumes are mounted List of "(name, claimName, mountPath)" of volumes, with persistentVolumeClaim, to be mounted to the infrastructure

Advanced Options

Key Type Default Description
roceGdrResName string nvidia.com/roce_gdr RoCE GDR resource name (can vary by cluster configuration)
numRoceGdr integer 0 number of nvidia.com/roce_grd resources (0 means disabled; >0 means enable GDR over RoCE). Must be 0 unless numPods > 1.
topologyFileConfigMap string nil Name of configmap containining /var/run/nvidia-topologyd/virtualTopology.xml for the system e.g. nvidia-topo-gdr
ncclGdrEnvConfigMap string nil Name of configmap containing NCCL networking environment variables for the system e.g. nccl-netwk-env-vars
multiNicNetworkName string nil Name of multi-NIC network, if one is available. Note: when GDR over RoCE is used/available, the RoCE multi-nic network instance should be specified here instead of the TCP multi-nic network instance. Existing instance names can be listed with oc get multinicnetwork.
disableSharedMemory boolean false Control whether or not a shared memory volume is added to the PyTorchJob.
mountNVMe object nil Mount NVMe as a volume. The environment variable MOUNT_PATH_NVME provides the runtime mount path
initContainers array nil List of "(name, image, command[])" specifying an init containers to be run before the main job. The 'command' field is a list of commands to run in the container, see the Kubernetes entry on initContainers for reference.
autopilotHealthChecks array No pre-flight checks are enabled. Autopilot health checks. List of labels enabling one or more system health pre-flight checks.
hostIgnoreList array nil List of host names on which the Job must not be scheduled (to avoid faulty nodes).
schedulerName string nil If non-nil, use the specified Kubernetes scheduler. Setting this to the default-scheduler may result in GPU fragmentation on the cluster. Setting this to any non-nil value should only be done when explicitly directed to do so by a cluster admin!
serviceAccountName string the default service account for the namespace will be used. Service account to be used for running the Job

Fault Tolerance

Key Type Default Description
admissionGracePeriodDuration string The AppWrapper defaults will be used Customize the admissionGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
warmupGracePeriodDuration string The AppWrapper defaults will be used Customize the warmupGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
failureGracePeriodDuration string The AppWrapper defaults will be used Customize the failureGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
retryPausePeriodDuration string The AppWrapper defaults will be used Customize the retryPausePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
retryLimit integer The AppWrapper defaults will be used Customize the retryLimit; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
forcefulDeletionGracePeriodDuration string The AppWrapper defaults will be used Customize the forcefulDelectionGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
deletionOnFailureGracePeriodDuration string The AppWrapper defaults will be used Customize the deletionOnFailureGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
restartPolicy string "Never" Set Kubernertes policy for restarting failed containers "in place" (without restarting the Pod).
terminationGracePeriodSeconds integer Kubernetes's default value is used Set a non-default pod termination grace period (in seconds).