Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Webhook Assigning Identical TPU_WORKER_IDs #859

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

ryanaoleary
Copy link
Collaborator

@ryanaoleary ryanaoleary commented Oct 25, 2024

This PR adds a sync.WaitGroup object and integer waiting var to each TPUWebhookServer object. All goroutine calls to mutatePod now Wait() before listing from the PodInformer cache, or timeout after 1 second, and then increment the waiting var and call wg.Add(1). wg.Done() is called by the AddFunc EventHandler, which indicates the PodInformer has updated with the last Pod admitted by that webhook replica. This ensures the PodInformer cache is available and updating prior to listing from the cache. To support multiple webhook replicas, wg.Done() is only called if the int waiting var on the TPUWebhookServer object is greater than 1. This allows the webhook to block on previous TPUWebhookServer.Mutate calls until the PodInformer cache updates at least once. I also added error checking for identical TPU_WORKER_IDs being assigned within the same slice (as opposed to just letting the Jax initialization time out).

Testing:

  • Unit Tests
  • Manual Tests: tested with a v6e-256 Ray TPU worker group for 1 and 3 webhook replicas respectively

Related Issue #: 858

Signed-off-by: Ryan O'Leary <[email protected]>
@ryanaoleary ryanaoleary self-assigned this Oct 25, 2024
@ryanaoleary ryanaoleary marked this pull request as draft October 25, 2024 08:16
@ryanaoleary ryanaoleary marked this pull request as ready for review October 26, 2024 03:42
@ryanaoleary ryanaoleary changed the title [WIP] Fix Webhook Assigning Identical TPU_WORKER_IDs Fix Webhook Assigning Identical TPU_WORKER_IDs Oct 26, 2024
@ryanaoleary
Copy link
Collaborator Author

I removed a troubleshooting section with the following:

## `TPU_WORKER_ID` assigned to multiple TPU workers in slice

### Symptoms
The webhook outputs the error message `Identical TPU_WORKER_ID assigned to multiple TPU workers in slice`.

### Solution #1
The Ray TPU webhook relies on a PodInformer cache to retrieve the current state of TPU worker Pods in a RayCluster and assign `TPU_WORKER_ID`s. This informer is synced prior to each Pod mutation. However, when quickly deleting and and re-creating RayClusters (especially for larger worker groups), it's possible for the PodLister to retrieve stale information and incorrectly assign `TPU_WORKER_ID`s. This issue is more likely to occur with a large number of worker nodes. The easiest solution in this case is to just delete the Ray custom resource and create it again with `kubectl apply`.

I can add this section back in as well as the PodInformer logic if we want to keep using the cache rather than querying the API server directly to ensure consistency.

ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
@@ -708,6 +752,19 @@ func init() {
klog.InitFlags(nil)
}

// addPod allows next goroutine to start once the webhook PodInformer cache updates
func (t *TPUWebhookServer) addPod(obj interface{}) {
// It's not guaranteed the webhook replica that admitted the Pod for this event is the same as the current caller (i.e. wg could be 0).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're not actually waiting for the pod event for the mutated pod, this is effectively adding some arbitrary delay to pod admission, which may be fixing the race condition with TPU_WORKER_ID.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddFunc is only triggered when a Pod with labels "ray.io/node-type=worker,app.kubernetes.io/created-by=kuberay-operator" is added. I can change it to check that it's a TPU Pod with the injected env vars before calling wg.Done(), but I wanted to err towards releasing the Wait versus continuously blocking. From my manual testing Timed out waiting for PodInformer AddFunc was never showing up in the logs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logic to check that the Pod in addPod is a TPU worker Pod before unblocking the next Mutate call:
04bf73c

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim I went ahead in 047ff5f and changed it to check that the Pod admitted to the cache is the last TPU worker Pod mutated by that webhook replica before unblocking the next goroutine. We check this by adding a lastAdmitted var to each TPUWebhookServer and setting it to <replicaIndex>-<TPU_WORKER_ID> which are vars set for both single-host and multi-host TPUs. If the webhook Pod restarts in between Pod admission requests, the value of lastAdmitted will be empty and the addPod function will be a no-op (i.e. it won't wait for anything). The PodInformer will be initialized again and obtain an up-to-date list of Pods from the API server, so the next Mutate call should proceed correctly. Otherwise, each webhook Mutate request will wait for the PodInformer cache to update from the previous request before proceeding which should ensure unique TPU_WORKER_IDs even when large slice sizes result in PodInformer updates that are slower than the latency between mutating admission requests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From offline discussion, with f87aec0 we now check for <namespace>-<RayCluster name>-<replicaIndex>-<TPU_WORKER_ID> which should catch all the cases.

ray-on-gke/tpu/kuberay-tpu-webhook/main.go Outdated Show resolved Hide resolved
return nil, err
}
// set the unique identifier for the last admitted Pod by this TPUWebhookServer
t.lastAdmitted = fmt.Sprintf("%s-%s-%d-%d", namespace, clusterName, replicaIndex, tpuWorkerID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should last admitted be tracked per RayCluster?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the Pods are admitted in series by each webhook replica I think it works the same to just check that the last admitted Pod has been added to the cache, regardless of RayCluster. If we did it per-RayCluster we'd also have to handle RayCluster deletion to clean up the list of lastAdmitted Pods.

Signed-off-by: ryanaoleary <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Ray TPU Webhook] Pod Informer inconsistency for large RayCluster sizes
2 participants