Skip to content

Commit

Permalink
Enforce that numRocGdr must be 0 unless numProcs > 1 (#63)
Browse files Browse the repository at this point in the history
  • Loading branch information
dgrove-oss authored Sep 17, 2024
1 parent 57eb87a commit e9db365
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 3 deletions.
2 changes: 1 addition & 1 deletion tools/pytorchjob-generator/chart/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ customize the Jobs generated by the tool.
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| roceGdrResName | string | nvidia.com/roce_gdr | RoCE GDR resource name (can vary by cluster configuration) |
| numRoceGdr | integer | `0` | number of nvidia.com/roce_grd resources (0 means disabled; >0 means enable GDR over RoCE). |
| numRoceGdr | integer | `0` | number of nvidia.com/roce_grd resources (0 means disabled; >0 means enable GDR over RoCE). Must be 0 unless numPods > 1. |
| topologyFileConfigMap | string | `nil` | Name of configmap containining /var/run/nvidia-topologyd/virtualTopology.xml for the system e.g. nvidia-topo-gdr |
| ncclGdrEnvConfigMap | string | `nil` | Name of configmap containing NCCL networking environment variables for the system e.g. nccl-netwk-env-vars |
| multiNicNetworkName | string | `nil` | Name of multi-NIC network, if one is available. Note: when GDR over RoCE is used/available, the RoCE multi-nic network instance should be specified here instead of the TCP multi-nic network instance. Existing instance names can be listed with `oc get multinicnetwork`. |
Expand Down
13 changes: 12 additions & 1 deletion tools/pytorchjob-generator/chart/values.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@
{ "type": "null" },
{ "type": "string" }
]},
"numRoceGdr": { "type": "integer" },
"numRoceGdr": { "type": "integer", "minimum": 0, "maximum": 2 },
"topologyFileConfigMap": { "oneOf": [
{ "type": "null" },
{ "$ref": "#/$defs/rfc1123Label" }
Expand Down Expand Up @@ -134,6 +134,17 @@
"deletionOnFailureGracePeriodDuration" : { "$ref": "#/$defs/duration" }
},

"if": {
"properties": {
"numPods": { "const": 1 }
}
},
"then": {
"properties": {
"numRoceGdr": { "const": 0 }
}
},

"$defs": {
"rfc1123Label": {
"type": "string",
Expand Down
2 changes: 1 addition & 1 deletion tools/pytorchjob-generator/chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ volumes:
# @section -- Advanced Options
roceGdrResName: # <optional, default="">

# -- (integer) number of nvidia.com/roce_grd resources (0 means disabled; >0 means enable GDR over RoCE).
# -- (integer) number of nvidia.com/roce_grd resources (0 means disabled; >0 means enable GDR over RoCE). Must be 0 unless numPods > 1.
# @section -- Advanced Options
numRoceGdr: 0

Expand Down

0 comments on commit e9db365

Please sign in to comment.