Fix Webhook Assigning Identical `TPU_WORKER_ID`s #859

ryanaoleary · 2024-10-25T08:16:47Z

This PR adds a sync.WaitGroup object and integer waiting var to each TPUWebhookServer object. All goroutine calls to mutatePod now Wait() before listing from the PodInformer cache, or timeout after 1 second, and then increment the waiting var and call wg.Add(1). wg.Done() is called by the AddFunc EventHandler, which indicates the PodInformer has updated with the last Pod admitted by that webhook replica. This ensures the PodInformer cache is available and updating prior to listing from the cache. To support multiple webhook replicas, wg.Done() is only called if the int waiting var on the TPUWebhookServer object is greater than 1. This allows the webhook to block on previous TPUWebhookServer.Mutate calls until the PodInformer cache updates at least once. I also added error checking for identical TPU_WORKER_IDs being assigned within the same slice (as opposed to just letting the Jax initialization time out).

Testing:

Unit Tests
Manual Tests: tested with a v6e-256 Ray TPU worker group for 1 and 3 webhook replicas respectively

Related Issue #: 858

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2024-10-26T03:47:42Z

I removed a troubleshooting section with the following:

## `TPU_WORKER_ID` assigned to multiple TPU workers in slice

### Symptoms
The webhook outputs the error message `Identical TPU_WORKER_ID assigned to multiple TPU workers in slice`.

### Solution #1
The Ray TPU webhook relies on a PodInformer cache to retrieve the current state of TPU worker Pods in a RayCluster and assign `TPU_WORKER_ID`s. This informer is synced prior to each Pod mutation. However, when quickly deleting and and re-creating RayClusters (especially for larger worker groups), it's possible for the PodLister to retrieve stale information and incorrectly assign `TPU_WORKER_ID`s. This issue is more likely to occur with a large number of worker nodes. The easiest solution in this case is to just delete the Ray custom resource and create it again with `kubectl apply`.

I can add this section back in as well as the PodInformer logic if we want to keep using the cache rather than querying the API server directly to ensure consistency.

Signed-off-by: Ryan O'Leary <[email protected]>

ray-on-gke/tpu/kuberay-tpu-webhook/main.go

andrewsykim · 2024-10-30T00:49:00Z

ray-on-gke/tpu/kuberay-tpu-webhook/main.go

@@ -708,6 +752,19 @@ func init() {
 	klog.InitFlags(nil)
 }

+// addPod allows next goroutine to start once the webhook PodInformer cache updates
+func (t *TPUWebhookServer) addPod(obj interface{}) {
+	// It's not guaranteed the webhook replica that admitted the Pod for this event is the same as the current caller (i.e. wg could be 0).


Since we're not actually waiting for the pod event for the mutated pod, this is effectively adding some arbitrary delay to pod admission, which may be fixing the race condition with TPU_WORKER_ID.

AddFunc is only triggered when a Pod with labels "ray.io/node-type=worker,app.kubernetes.io/created-by=kuberay-operator" is added. I can change it to check that it's a TPU Pod with the injected env vars before calling wg.Done(), but I wanted to err towards releasing the Wait versus continuously blocking. From my manual testing Timed out waiting for PodInformer AddFunc was never showing up in the logs.

Added logic to check that the Pod in addPod is a TPU worker Pod before unblocking the next Mutate call:
04bf73c

@andrewsykim I went ahead in 047ff5f and changed it to check that the Pod admitted to the cache is the last TPU worker Pod mutated by that webhook replica before unblocking the next goroutine. We check this by adding a lastAdmitted var to each TPUWebhookServer and setting it to <replicaIndex>-<TPU_WORKER_ID> which are vars set for both single-host and multi-host TPUs. If the webhook Pod restarts in between Pod admission requests, the value of lastAdmitted will be empty and the addPod function will be a no-op (i.e. it won't wait for anything). The PodInformer will be initialized again and obtain an up-to-date list of Pods from the API server, so the next Mutate call should proceed correctly. Otherwise, each webhook Mutate request will wait for the PodInformer cache to update from the previous request before proceeding which should ensure unique TPU_WORKER_IDs even when large slice sizes result in PodInformer updates that are slower than the latency between mutating admission requests.

From offline discussion, with f87aec0 we now check for <namespace>-<RayCluster name>-<replicaIndex>-<TPU_WORKER_ID> which should catch all the cases.

Signed-off-by: Ryan O'Leary <[email protected]>

Signed-off-by: ryanaoleary <[email protected]>

ray-on-gke/tpu/kuberay-tpu-webhook/main.go

andrewsykim · 2024-11-08T04:23:50Z

ray-on-gke/tpu/kuberay-tpu-webhook/main.go

+		return nil, err
+	}
+	// set the unique identifier for the last admitted Pod by this TPUWebhookServer
+	t.lastAdmitted = fmt.Sprintf("%s-%s-%d-%d", namespace, clusterName, replicaIndex, tpuWorkerID)


Should last admitted be tracked per RayCluster?

Since the Pods are admitted in series by each webhook replica I think it works the same to just check that the last admitted Pod has been added to the cache, regardless of RayCluster. If we did it per-RayCluster we'd also have to handle RayCluster deletion to clean up the list of lastAdmitted Pods.

Signed-off-by: ryanaoleary <[email protected]>

Error on same TPU_WORKER_ID

b49ab4d

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary self-assigned this Oct 25, 2024

ryanaoleary marked this pull request as draft October 25, 2024 08:16

ryanaoleary requested a review from andrewsykim October 25, 2024 08:19

ryanaoleary linked an issue Oct 25, 2024 that may be closed by this pull request

[Ray TPU Webhook] Pod Informer inconsistency for large RayCluster sizes #858

Open

Make webhook strongly consistent

9b51248

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary marked this pull request as ready for review October 26, 2024 03:42

ryanaoleary changed the title ~~[WIP] Fix Webhook Assigning Identical TPU_WORKER_IDs~~ Fix Webhook Assigning Identical TPU_WORKER_IDs Oct 26, 2024

Remove informer troubleshooting section

806dbe8

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary added 2 commits October 29, 2024 20:13

Wait() until AddFunc is called

8c43135

Signed-off-by: Ryan O'Leary <[email protected]>

Change back webhook_main_test

c6ca1a1

Signed-off-by: Ryan O'Leary <[email protected]>

andrewsykim reviewed Oct 30, 2024

View reviewed changes

ryanaoleary added 2 commits October 30, 2024 01:05

Remove extra waitTimeout

ebe0f33

Signed-off-by: Ryan O'Leary <[email protected]>

Check addPod is a TPU worker Pod before unblocking

04bf73c

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested a review from andrewsykim October 30, 2024 01:55

ryanaoleary added 2 commits November 5, 2024 03:02

Check that Pod is exact last Pod admitted

047ff5f

Signed-off-by: Ryan O'Leary <[email protected]>

Check for RayCluster name and namespace too

f87aec0

Signed-off-by: ryanaoleary <[email protected]>

andrewsykim reviewed Nov 8, 2024

View reviewed changes

Add cluster label to filter

d1a44cc

Signed-off-by: ryanaoleary <[email protected]>

ryanaoleary requested review from andrewsykim and spencer-p November 11, 2024 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Webhook Assigning Identical `TPU_WORKER_ID`s #859

Fix Webhook Assigning Identical `TPU_WORKER_ID`s #859

ryanaoleary commented Oct 25, 2024 •

edited

Loading

ryanaoleary commented Oct 26, 2024

andrewsykim Oct 30, 2024

ryanaoleary Oct 30, 2024

ryanaoleary Oct 30, 2024

ryanaoleary Nov 5, 2024

ryanaoleary Nov 6, 2024

andrewsykim Nov 8, 2024

ryanaoleary Nov 11, 2024

Fix Webhook Assigning Identical TPU_WORKER_IDs #859

Are you sure you want to change the base?

Fix Webhook Assigning Identical TPU_WORKER_IDs #859

Conversation

ryanaoleary commented Oct 25, 2024 • edited Loading

ryanaoleary commented Oct 26, 2024

andrewsykim Oct 30, 2024

Choose a reason for hiding this comment

ryanaoleary Oct 30, 2024

Choose a reason for hiding this comment

ryanaoleary Oct 30, 2024

Choose a reason for hiding this comment

ryanaoleary Nov 5, 2024

Choose a reason for hiding this comment

ryanaoleary Nov 6, 2024

Choose a reason for hiding this comment

andrewsykim Nov 8, 2024

Choose a reason for hiding this comment

ryanaoleary Nov 11, 2024

Choose a reason for hiding this comment

Fix Webhook Assigning Identical `TPU_WORKER_ID`s #859

Fix Webhook Assigning Identical `TPU_WORKER_ID`s #859

ryanaoleary commented Oct 25, 2024 •

edited

Loading