Skip to content

Commit

Permalink
Expand 2nd section up to nodeSelector key
Browse files Browse the repository at this point in the history
Signed-off-by: davidmirror-ops <[email protected]>
  • Loading branch information
davidmirror-ops committed Jun 25, 2024
1 parent fe51590 commit 2f9cc5e
Showing 1 changed file with 79 additions and 3 deletions.
82 changes: 79 additions & 3 deletions docs/user_guide/productionizing/configuring_access_to_gpus.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,12 @@ The goal here is to make a simple request of any available GPU device(s).

![](https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/gpus/generic_gpu_access.png)

When this task is executed, `flyteproller` injects a [toleration](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) in the Pod spec:
When this task is evaluated, `flyteproller` injects a [toleration](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) in the pod spec:

```yaml
tolerations: nvidia.com/gpu:NoSchedule op=Exists
```
The Kubernetes scheduler will admit the pods if there are worker nodes with matching taints and available resources in the cluster.
The Kubernetes scheduler will admit the pod if there are worker nodes with matching taints and available resources in the cluster.
The resource `nvidia.com/gpu` key name is not arbitrary. It corresponds to the [Extended Resource](https://kubernetes.io/docs/tasks/administer-cluster/extended-resource-node/) that the Kubernetes worker nodes advertise to the API server through the [device plugin](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins).

Expand Down Expand Up @@ -99,10 +99,86 @@ configuration:

## Requesting a specific GPU device

### Infrastructure requirements
Example:
```python
from flytekit import ImageSpec, Resources, task
from flytekit.extras.accelerators import V100
image = ImageSpec(
base_image= "ghcr.io/flyteorg/flytekit:py3.10-1.10.2",
name="pytorch",
python_version="3.10",
packages=["torch"],
builder="envd",
registry="<YOUR_CONTAINER_REGISTRY>",
)
@task(requests=Resources( gpu="1"),
accelerator=V100,
) #NVIDIA Tesla V100
def gpu_available() -> bool:
return torch.cuda.is_available()
```
Leveraging a flytekit feature, you can specify the accelerator device in the task decorator .

### How it works?

When this task is evaluated, `flytepropeller` injects both a toleration and a nodeSelector for a more flexible scheduling configuration.

An example pod spec on GKE would include the following:

```yaml
apiVersion: v1
kind: Pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: In
values:
- nvidia-tesla-v100
containers:
- resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: nvidia.com/gpu # auto
operator: Equal
value: present
effect: NoSchedule
- key: cloud.google.com/gke-accelerator
operator: Equal
value: nvidia-tesla-v100
effect: NoSchedule
```
### Configuring the nodeSelector
The `key` that the injected node selector uses corresponds to an arbitrary label that your Kubernetes worker nodes should apply. In the above example it's `cloud.google.com/gke-accelerator` but, depending on your cloud provider it could be any other value. You can inform Flyte about the labels your worker nodes use by adjusting the Helm values:

**flyte-core**
```yaml
configmap:
k8s:
plugins:
k8s:
gpu-device-node-label: "cloud.google.com/gke-accelerator" #change to match your node's config
```
**flyte-binary**
```yaml
configuration:
inline:
plugins:
k8s:
gpu-device-node-label: "cloud.google.com/gke-accelerator" #change to match your node's config
```
While the `key` is arbitrary the `value` is not. flytekit has a set of [predefined](https://docs.flyte.org/en/latest/api/flytekit/extras.accelerators.html#predefined-accelerator-constants) constants and your node label has to use one of those keys.





## Requesting a GPU partition

Flyte
Expand Down

0 comments on commit 2f9cc5e

Please sign in to comment.