Skip to content

Commit

Permalink
Reorganize instructions
Browse files Browse the repository at this point in the history
Signed-off-by: davidmirror-ops <[email protected]>
  • Loading branch information
davidmirror-ops committed Jul 23, 2024
1 parent 88eedf4 commit 481c584
Showing 1 changed file with 29 additions and 21 deletions.
50 changes: 29 additions & 21 deletions docs/user_guide/productionizing/configuring_access_to_gpus.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,14 @@
.. tags:: Deployment, Infrastructure, GPU, Intermediate
```

Along with compute resources like CPU and Memory, you may want to configure and access GPU resources.
Along with compute resources like CPU and memory, you may want to configure and access GPU resources.

Flyte provides different ways to request accelerator resources directly from the task decorator.

>The examples in this section use [ImageSpec](https://docs.flyte.org/en/latest/user_guide/customizing_dependencies/imagespec.html#imagespec), a Flyte feature that builds a custom container image without a Dockerfile. Install it using `pip install flytekitplugins-envd`.
Example:
## Requesting any available GPU device(s)
The goal here is to run the task on a single GPU device:

```python
from flytekit import ImageSpec, Resources, task
Expand All @@ -28,9 +29,9 @@ image = ImageSpec(

@task(requests=Resources( gpu="1"))
def gpu_available() -> bool:
return torch.cuda.is_available()
return torch.cuda.is_available() # returns True if CUDA (provided by a GPU) is available
```
The goal here is to request any available GPU device(s).


### How it works?

Expand All @@ -43,11 +44,11 @@ tolerations: nvidia.com/gpu:NoSchedule op=Exists
```
The Kubernetes scheduler will admit the pod if there are worker nodes in the cluster with a matching taint and available resources.
The resource `nvidia.com/gpu` key name is not arbitrary. It corresponds to the [Extended Resource](https://kubernetes.io/docs/tasks/administer-cluster/extended-resource-node/) that the Kubernetes worker nodes advertise to the API server through the [device plugin](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins).
The resource `nvidia.com/gpu` key name is not arbitrary though. It corresponds to the [Extended Resource](https://kubernetes.io/docs/tasks/administer-cluster/extended-resource-node/) that the Kubernetes worker nodes advertise to the API server through the [device plugin](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins).

>NVIDIA maintains a [GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) that automates the management of all software prerequisites on Kubernetes.
>NVIDIA maintains a [GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) that automates the management of all software prerequisites on Kubernetes, including the device plugin.

If your GPU accelerators expose a different resource name, adjust the following key in the Helm values file:
``flytekit`` assumes by default that `nvidia.com/gpu` is the resource name for your GPUs. If your GPU accelerators expose a different resource name, adjust the following key in the Helm values file:

**flyte-core**
```yaml
Expand Down Expand Up @@ -95,11 +96,13 @@ configuration:
value: "myvalue"
effect: "NoSchedule"
```
>For the above configuration, your worker nodes should be tainted with `mykey=myvalue:NoSchedule`.
>For the above configuration, your worker nodes should have a `mykey=myvalue:NoSchedule` configured [taint](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/).

## Requesting a specific GPU device

Example:
The goal is to run the task on a specific type of accelerator: NVIDIA Tesla V100 in the following example:


```python
from flytekit import ImageSpec, Resources, task
from flytekit.extras.accelerators import V100
Expand All @@ -119,11 +122,11 @@ image = ImageSpec(
def gpu_available() -> bool:
return torch.cuda.is_available()
```
Leveraging a flytekit feature, you can request a specific accelerator device from the task decorator .


### How it works?

When this task is evaluated, `flytepropeller` injects both a toleration and a nodeSelector for a more flexible scheduling configuration.
When this task is evaluated, `flytepropeller` injects both a toleration and a [nodeSelector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector) for a more flexible scheduling configuration.

An example pod spec on GKE would include the following:

Expand Down Expand Up @@ -155,7 +158,7 @@ spec:
effect: NoSchedule
```
### Configuring the nodeSelector
The `key` that the injected node selector uses corresponds to an arbitrary label that your Kubernetes worker nodes should apply. In the above example it's `cloud.google.com/gke-accelerator` but, depending on your cloud provider it could be any other value. You can inform Flyte about the labels your worker nodes use by adjusting the Helm values:
The `key` that the injected node selector uses corresponds to an arbitrary label that your Kubernetes worker nodes should already have. In the above example it's `cloud.google.com/gke-accelerator` but, depending on your cloud provider it could be any other value. You can inform Flyte about the labels your worker nodes use by adjusting the Helm values:

**flyte-core**
```yaml
Expand All @@ -177,6 +180,8 @@ While the `key` is arbitrary, the value (`nvidia-tesla-v100`) is not. `flytekit`

## Requesting a GPU partition

`flytekit` supports [Multi-Instance GPU partitioning](https://developer.nvidia.com/blog/getting-the-most-out-of-the-a100-gpu-with-multi-instance-gpu/#mig_partitioning_and_gpu_instance_profiles) on NVIDIA A100 devices for optimal resource utilization.

Example:
```python
from flytekit import ImageSpec, Resources, task
Expand All @@ -198,7 +203,7 @@ def gpu_available() -> bool:
return torch.cuda.is_available()
```
### How it works?
In this case, ``flytepropeller`` injects an additional nodeSelector to the resulting Pod spec, indicating the partition size:
In this case, ``flytepropeller`` injects an additional node selector expression to the resulting pod spec, indicating the partition size:

```yaml
spec:
Expand Down Expand Up @@ -230,11 +235,18 @@ Plus and additional toleration:
operator: Equal
value: 2g.10gb
```
In consequence, your nodes should have at least matching labels for the Kubernetes scheduler to admit the Pods. If you want to better control scheduling, add matching taints to the nodes.
In consequence, your Kubernetes worker nodes should have matching labels so the Kubernetes scheduler can admit the Pods:

Node labels (example):
```yaml
nvidia.com/gpu.partition-size: "2g.10gb"
nvidia.com/gpu.accelerator: "nvidia-tesla-a100"
```

If you want to better control scheduling, configure your worker nodes with taints that match the tolerations injected to the pods.

>NOTE: currently, ``flytekit`` supports partitions only for NVIDIA A100 devices

In the previous example the ``nvidia.com/gpu.partition-size`` key is arbitrary and can be controlled from the Helm chart:
In the example the ``nvidia.com/gpu.partition-size`` key is arbitrary and can be controlled from the Helm chart:

**flyte-core**
```yaml
Expand All @@ -252,10 +264,6 @@ configuration:
k8s:
gpu-partition-size-node-label: "nvidia.com/gpu.partition-size" #change to match your node's config
```
The ``2g.10gb`` value comes from the [NVIDIA A100 supported instance profiles](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#concepts) and it's controlled from the Task decorator (``accelerator=A100.partition_2g_10gb`` in the above example). Depending on the profile requested in the Task, Flyte will inject the corresponding value for the node selector. Your nodes have to use matching labels:

```yaml
nvidia.com/gpu.partition-size: "2g.10gb"
```
The ``2g.10gb`` value comes from the [NVIDIA A100 supported instance profiles](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#concepts) and it's controlled from the Task decorator (``accelerator=A100.partition_2g_10gb`` in the above example). Depending on the profile requested in the Task, Flyte will inject the corresponding value for the node selector.

>Learn more about the full list of ``flytekit`` supported partition profiles and task decorator options [here](https://docs.flyte.org/en/latest/api/flytekit/generated/flytekit.extras.accelerators.A100.html#flytekit.extras.accelerators.A100)

0 comments on commit 481c584

Please sign in to comment.