Update GPU docs #5515

davidmirror-ops · 2024-06-26T15:01:35Z

The purpose is to bring the information contained in the original flytekit accelerators PR (#4172) and complement with learnings from testing the feature on a live environment with access to GPU devices.
The "How it works" section for each use case, should speak to the platform engineers tasked with preparing the infrastructure so Flyte users can request accelerators from Python decorators.

How was this patch tested?

Tested on AKS using both V100 and A100 GPUs.

Setup process

Screenshots

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

Signed-off-by: davidmirror-ops <[email protected]>

codecov · 2024-06-26T15:09:00Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 35.91%. Comparing base (de415af) to head (13c5e8f).
Report is 185 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #5515       +/-   ##
===========================================
- Coverage   61.00%   35.91%   -25.10%     
===========================================
  Files         793     1301      +508     
  Lines       51378   109401    +58023     
===========================================
+ Hits        31342    39287     +7945     
- Misses      17149    66017    +48868     
- Partials     2887     4097     +1210

Flag	Coverage Δ
unittests-datacatalog	`51.37% <ø> (-17.95%)`	⬇️
unittests-flyteadmin	`53.73% <ø> (-4.98%)`	⬇️
unittests-flytecopilot	`12.17% <ø> (-5.62%)`	⬇️
unittests-flytectl	`62.28% <ø> (-5.77%)`	⬇️
unittests-flyteidl	`7.09% <ø> (-71.96%)`	⬇️
unittests-flyteplugins	`53.31% <ø> (-8.53%)`	⬇️
unittests-flytepropeller	`41.75% <ø> (-15.56%)`	⬇️
unittests-flytestdlib	`55.27% <ø> (-10.56%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: davidmirror-ops <[email protected]>

thomasjpfan · 2024-07-24T14:20:54Z

docs/user_guide/productionizing/configuring_access_to_gpus.md

+image = ImageSpec(
+    base_image= "ghcr.io/flyteorg/flytekit:py3.10-1.10.2",
+     name="pytorch",
+     python_version="3.10",
+     packages=["torch"],
+     builder="envd",
+     registry="<YOUR_CONTAINER_REGISTRY>",
+ )


I do not think this this works with the envd image builder. This should work tho:

image = ImageSpec( name="pytorch", python_version="3.10", packages=["torch"], builder="default", registry=... )

thomasjpfan · 2024-07-24T14:24:56Z

docs/user_guide/productionizing/configuring_access_to_gpus.md

+```yaml
+tolerations:    nvidia.com/gpu:NoSchedule op=Exists
+```
+The Kubernetes scheduler will admit the pod if there are worker nodes in the cluster with a matching taint and available resources.


When accelerator is not specified and there are multiple GPUs accelerator types configured, how does Kubernetes decide which accelerator to use?

Great question. Currently K8s uses 3rd-party Device Driver plugins that advertise Extended Resources (this is, things other than CPU and memory) to the kubelet. In the case of NVIDIA, their GPU operator implements this approach using a DaemonSet that communicates with the K8s scheduler to let it know which devices are available. While it works, it's not great because there's a lot of coordination required between the K8s scheduler and this external driver, no good management of "distributed locking" for shared resources, and in consequence, Pods can be scheduled to a node but left in Pending state bc one of those Extended Resources is not ready.
This is changing with the introduction of Dynamic Resource Allocation (soon to be in beta on K8s 1.31)

I just added a brief note about this but let me know if you think a better explanation is needed.

neverett

Left some suggestions, which you can take or leave as you see fit, otherwise LGTM

neverett · 2024-07-24T15:24:51Z

docs/user_guide/productionizing/configuring_access_to_gpus.md


-You can also configure Flyte backend to apply specific tolerations. This configuration is controlled under generic  k8s plugin configuration as can be found [here](https://github.com/flyteorg/flyteplugins/blob/5a00b19d88b93f9636410a41f81a73356a711482/go/tasks/pluginmachinery/flytek8s/config/config.go#L120).
+>The examples in this section use [ImageSpec](https://docs.flyte.org/en/latest/user_guide/customizing_dependencies/imagespec.html#imagespec), a Flyte feature that builds a custom container image without a Dockerfile. Install it using `pip install flytekitplugins-envd`.


Suggested change

>The examples in this section use [ImageSpec](https://docs.flyte.org/en/latest/user_guide/customizing_dependencies/imagespec.html#imagespec), a Flyte feature that builds a custom container image without a Dockerfile. Install it using `pip install flytekitplugins-envd`.

The examples in this section use [ImageSpec](https://docs.flyte.org/en/latest/user_guide/customizing_dependencies/imagespec.html#imagespec), a Flyte feature that builds a custom container image without a Dockerfile. Install it using `pip install flytekitplugins-envd`.

neverett · 2024-07-24T15:25:23Z

docs/user_guide/productionizing/configuring_access_to_gpus.md

-adds the matching toleration for that resource (in this case, `gpu`) to the generated PodSpec.
-As it follows here, you can configure it to access specific resources using the tolerations for all resources supported by
-Kubernetes.
+## Requesting a GPU with no preference for device


Suggested change

## Requesting a GPU with no preference for device

## Requesting a GPU with no device preference

neverett · 2024-07-24T15:25:44Z

docs/user_guide/productionizing/configuring_access_to_gpus.md

-As it follows here, you can configure it to access specific resources using the tolerations for all resources supported by
-Kubernetes.
+## Requesting a GPU with no preference for device
+The goal in this example is to run the task on a single available GPU :


Suggested change

The goal in this example is to run the task on a single available GPU :

In this example, we run a task on a single available GPU:

neverett · 2024-07-24T15:26:05Z

docs/user_guide/productionizing/configuring_access_to_gpus.md

+def gpu_available() -> bool:
+   return torch.cuda.is_available() # returns True if CUDA (provided by a GPU) is available
+```
+### How it works?


Suggested change

### How it works?

### How it works

neverett · 2024-07-24T15:26:28Z

docs/user_guide/productionizing/configuring_access_to_gpus.md

+
+![](https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/gpus/generic_gpu_access.png)
+
+When this task is evaluated, `flyteproller` injects a [toleration](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) in the pod spec:


Suggested change

When this task is evaluated, `flyteproller` injects a [toleration](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) in the pod spec:

When this task is evaluated, `flytepropeller` injects a [toleration](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) in the pod spec:

oh my , good catch, thanks!

neverett · 2024-07-24T15:35:41Z

docs/user_guide/productionizing/configuring_access_to_gpus.md

+   return torch.cuda.is_available()
+```
+
+#### How it works?


Suggested change

#### How it works?

#### How it works

neverett · 2024-07-24T15:36:10Z

docs/user_guide/productionizing/configuring_access_to_gpus.md

+
+
+### Request an unpartitioned A100 device
+The goal is to run the task using the resources of the entire A100 GPU:


Suggested change

The goal is to run the task using the resources of the entire A100 GPU:

In the following example, we run a task using the resources of the entire A100 GPU:

neverett · 2024-07-24T15:36:18Z

docs/user_guide/productionizing/configuring_access_to_gpus.md

+   return torch.cuda.is_available()
+```
+
+#### How it works?


Suggested change

#### How it works?

#### How it works

neverett · 2024-07-24T15:36:29Z

docs/user_guide/productionizing/configuring_access_to_gpus.md

+
+#### How it works?
+
+When this task is evaluated `flytepropeller` injects a node selector expression that only matches nodes where the label specifying a partition size is **not** present:


Suggested change

When this task is evaluated `flytepropeller` injects a node selector expression that only matches nodes where the label specifying a partition size is **not** present:

When this task is evaluated, `flytepropeller` injects a node selector expression that only matches nodes where the label specifying a partition size is **not** present:

neverett · 2024-07-24T15:36:51Z

docs/user_guide/productionizing/configuring_access_to_gpus.md

+```
+
+
+Scheduling can be further controlled by setting in the Helm chart a toleration that `flytepropeller` injects into the task pods:


Suggested change

Scheduling can be further controlled by setting in the Helm chart a toleration that `flytepropeller` injects into the task pods:

Scheduling can be further controlled by setting a toleration in the Helm chart that `flytepropeller` injects into the task pods:

Maybe it's because English is not my native tongue :) but I read the suggestion as if it's the Helm chart that is injected

Oh, I see what you mean -- you can disregard my suggestion, then, the original text is much clearer!

Signed-off-by: davidmirror-ops <[email protected]>

* Introduce 3 levels Signed-off-by: davidmirror-ops <[email protected]> * Fix ImageSpec config Signed-off-by: davidmirror-ops <[email protected]> * Rephrase 1st section and prereqs Signed-off-by: davidmirror-ops <[email protected]> * Expand 2nd section up to nodeSelector key Signed-off-by: davidmirror-ops <[email protected]> * Add partition scheduling info Signed-off-by: davidmirror-ops <[email protected]> * Reorganize instructions Signed-off-by: davidmirror-ops <[email protected]> * Improve clarity Signed-off-by: davidmirror-ops <[email protected]> * Apply reviews pt1 Signed-off-by: davidmirror-ops <[email protected]> * Add note on default scheduling behavior Signed-off-by: davidmirror-ops <[email protected]> * Add missing YAML and rephrase full A100 behavior Signed-off-by: davidmirror-ops <[email protected]> --------- Signed-off-by: davidmirror-ops <[email protected]> Signed-off-by: Vladyslav Libov <[email protected]>

davidmirror-ops added 4 commits June 21, 2024 07:07

Introduce 3 levels

8470059

Signed-off-by: davidmirror-ops <[email protected]>

Fix ImageSpec config

f4f2815

Signed-off-by: davidmirror-ops <[email protected]>

Rephrase 1st section and prereqs

fe51590

Signed-off-by: davidmirror-ops <[email protected]>

Expand 2nd section up to nodeSelector key

2f9cc5e

Signed-off-by: davidmirror-ops <[email protected]>

davidmirror-ops added 3 commits July 12, 2024 16:28

Add partition scheduling info

88eedf4

Signed-off-by: davidmirror-ops <[email protected]>

Reorganize instructions

481c584

Signed-off-by: davidmirror-ops <[email protected]>

Improve clarity

b81b19f

Signed-off-by: davidmirror-ops <[email protected]>

davidmirror-ops marked this pull request as ready for review July 23, 2024 21:18

davidmirror-ops requested review from neverett and ppiegaze as code owners July 23, 2024 21:18

thomasjpfan reviewed Jul 24, 2024

View reviewed changes

neverett previously approved these changes Jul 24, 2024

View reviewed changes

Apply reviews pt1

a181fe6

Signed-off-by: davidmirror-ops <[email protected]>

davidmirror-ops dismissed neverett’s stale review via a181fe6 July 24, 2024 21:53

Add note on default scheduling behavior

1be52eb

Signed-off-by: davidmirror-ops <[email protected]>

davidmirror-ops requested a review from neverett July 25, 2024 10:31

neverett previously approved these changes Jul 25, 2024

View reviewed changes

Add missing YAML and rephrase full A100 behavior

13c5e8f

Signed-off-by: davidmirror-ops <[email protected]>

davidmirror-ops dismissed neverett’s stale review via 13c5e8f July 25, 2024 19:47

davidmirror-ops enabled auto-merge (squash) July 25, 2024 20:29

davidmirror-ops merged commit 07d3cfc into flyteorg:master Jul 25, 2024
49 of 50 checks passed

davidmirror-ops mentioned this pull request Jul 25, 2024

[Docs] Write docs and examples for GPU acceleration #4620

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update GPU docs #5515

Update GPU docs #5515

davidmirror-ops commented Jun 26, 2024 •

edited

Loading

codecov bot commented Jun 26, 2024 •

edited

Loading

thomasjpfan Jul 24, 2024

thomasjpfan Jul 24, 2024

davidmirror-ops Jul 24, 2024

neverett left a comment

neverett Jul 24, 2024

neverett Jul 24, 2024

neverett Jul 24, 2024

neverett Jul 24, 2024

neverett Jul 24, 2024

davidmirror-ops Jul 24, 2024

neverett Jul 24, 2024

neverett Jul 24, 2024

neverett Jul 24, 2024

neverett Jul 24, 2024

neverett Jul 24, 2024

davidmirror-ops Jul 24, 2024

neverett Jul 25, 2024

davidmirror-ops Jul 25, 2024


		You can also configure Flyte backend to apply specific tolerations. This configuration is controlled under generic k8s plugin configuration as can be found [here](https://github.com/flyteorg/flyteplugins/blob/5a00b19d88b93f9636410a41f81a73356a711482/go/tasks/pluginmachinery/flytek8s/config/config.go#L120).
		>The examples in this section use [ImageSpec](https://docs.flyte.org/en/latest/user_guide/customizing_dependencies/imagespec.html#imagespec), a Flyte feature that builds a custom container image without a Dockerfile. Install it using `pip install flytekitplugins-envd`.

	>The examples in this section use [ImageSpec](https://docs.flyte.org/en/latest/user_guide/customizing_dependencies/imagespec.html#imagespec), a Flyte feature that builds a custom container image without a Dockerfile. Install it using `pip install flytekitplugins-envd`.
	The examples in this section use [ImageSpec](https://docs.flyte.org/en/latest/user_guide/customizing_dependencies/imagespec.html#imagespec), a Flyte feature that builds a custom container image without a Dockerfile. Install it using `pip install flytekitplugins-envd`.

	## Requesting a GPU with no preference for device
	## Requesting a GPU with no device preference

	The goal in this example is to run the task on a single available GPU :
	In this example, we run a task on a single available GPU:


		![](https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/gpus/generic_gpu_access.png)

		When this task is evaluated, `flyteproller` injects a [toleration](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) in the pod spec:

	When this task is evaluated, `flyteproller` injects a [toleration](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) in the pod spec:
	When this task is evaluated, `flytepropeller` injects a [toleration](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) in the pod spec:



		### Request an unpartitioned A100 device
		The goal is to run the task using the resources of the entire A100 GPU:

	The goal is to run the task using the resources of the entire A100 GPU:
	In the following example, we run a task using the resources of the entire A100 GPU:


		#### How it works?

		When this task is evaluated `flytepropeller` injects a node selector expression that only matches nodes where the label specifying a partition size is not present:

		```


		Scheduling can be further controlled by setting in the Helm chart a toleration that `flytepropeller` injects into the task pods:

Update GPU docs #5515

Update GPU docs #5515

Conversation

davidmirror-ops commented Jun 26, 2024 • edited Loading

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

codecov bot commented Jun 26, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neverett left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidmirror-ops commented Jun 26, 2024 •

edited

Loading

codecov bot commented Jun 26, 2024 •

edited

Loading