-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Webhook Assigning Identical TPU_WORKER_ID
s
#859
base: main
Are you sure you want to change the base?
Fix Webhook Assigning Identical TPU_WORKER_ID
s
#859
Conversation
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
TPU_WORKER_ID
sTPU_WORKER_ID
s
Signed-off-by: Ryan O'Leary <[email protected]>
I removed a troubleshooting section with the following:
I can add this section back in as well as the PodInformer logic if we want to keep using the cache rather than querying the API server directly to ensure consistency. |
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
@@ -708,6 +752,19 @@ func init() { | |||
klog.InitFlags(nil) | |||
} | |||
|
|||
// addPod allows next goroutine to start once the webhook PodInformer cache updates | |||
func (t *TPUWebhookServer) addPod(obj interface{}) { | |||
// It's not guaranteed the webhook replica that admitted the Pod for this event is the same as the current caller (i.e. wg could be 0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're not actually waiting for the pod event for the mutated pod, this is effectively adding some arbitrary delay to pod admission, which may be fixing the race condition with TPU_WORKER_ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AddFunc
is only triggered when a Pod with labels "ray.io/node-type=worker,app.kubernetes.io/created-by=kuberay-operator"
is added. I can change it to check that it's a TPU Pod with the injected env vars before calling wg.Done()
, but I wanted to err towards releasing the Wait versus continuously blocking. From my manual testing Timed out waiting for PodInformer AddFunc
was never showing up in the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added logic to check that the Pod in addPod
is a TPU worker Pod before unblocking the next Mutate
call:
04bf73c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewsykim I went ahead in 047ff5f and changed it to check that the Pod admitted to the cache is the last TPU worker Pod mutated by that webhook replica before unblocking the next goroutine. We check this by adding a lastAdmitted
var to each TPUWebhookServer
and setting it to <replicaIndex>-<TPU_WORKER_ID>
which are vars set for both single-host and multi-host TPUs. If the webhook Pod restarts in between Pod admission requests, the value of lastAdmitted
will be empty and the addPod
function will be a no-op (i.e. it won't wait for anything). The PodInformer will be initialized again and obtain an up-to-date list of Pods from the API server, so the next Mutate
call should proceed correctly. Otherwise, each webhook Mutate
request will wait for the PodInformer cache to update from the previous request before proceeding which should ensure unique TPU_WORKER_ID
s even when large slice sizes result in PodInformer updates that are slower than the latency between mutating admission requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From offline discussion, with f87aec0 we now check for <namespace>-<RayCluster name>-<replicaIndex>-<TPU_WORKER_ID>
which should catch all the cases.
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: ryanaoleary <[email protected]>
return nil, err | ||
} | ||
// set the unique identifier for the last admitted Pod by this TPUWebhookServer | ||
t.lastAdmitted = fmt.Sprintf("%s-%s-%d-%d", namespace, clusterName, replicaIndex, tpuWorkerID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should last admitted be tracked per RayCluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the Pods are admitted in series by each webhook replica I think it works the same to just check that the last admitted Pod has been added to the cache, regardless of RayCluster. If we did it per-RayCluster we'd also have to handle RayCluster deletion to clean up the list of lastAdmitted
Pods.
Signed-off-by: ryanaoleary <[email protected]>
This PR adds a
sync.WaitGroup
object and integerwaiting
var to eachTPUWebhookServer
object. All goroutine calls tomutatePod
nowWait()
before listing from the PodInformer cache, or timeout after 1 second, and then increment the waiting var and callwg.Add(1)
.wg.Done()
is called by theAddFunc
EventHandler
, which indicates the PodInformer has updated with the last Pod admitted by that webhook replica. This ensures the PodInformer cache is available and updating prior to listing from the cache. To support multiple webhook replicas,wg.Done()
is only called if the intwaiting
var on theTPUWebhookServer
object is greater than 1. This allows the webhook to block on previousTPUWebhookServer.Mutate
calls until the PodInformer cache updates at least once. I also added error checking for identicalTPU_WORKER_ID
s being assigned within the same slice (as opposed to just letting the Jax initialization time out).Testing:
Related Issue #: 858