ModelMesh behaviour during Node Drain (i.e. during Cluster Upgrade/Node Update) #142

SDJustus · 2024-06-21T09:58:47Z

Hey guys,

first and foremost, awesome project you have created. :)

I have a question on the ModelMesh behaviour during Node Drains in k8s.

My Setup:

I am running ModelMesh as a sidecar of a modelmesh-serving deployment on EKS with Pod Disruption Budgets configured for the serving runtimes, so that only one replica (of 4 replicas) is shut down at a time during node drains.
I use model mesh to serve models on triton utilizing gpu, that take appr. 25 sec to load on a new server due to a warmup configuration

My Node drain Observation:

model mesh runs a prestop of Sigterm for the pod, which includes to wait, until all deployed models are loaded elsewhere
however, during that time (after prestop, before all models are loaded elsewhere), model mesh doesn't accept new inference requests for that models, resulting in a ModelNotHereException, when another modelmesh instance tries to fulfil the inference request
this results for the inference request to be delayed, until the models are loaded on another instance, which in my case can take up to 25 secs
I can see, that during time, when the model mesh instance, that is to be shut down, tries to load the models somewhere else, the models are still loaded on the respective triton
this should enable modelmesh, to handle inference request even after pre stop

Example request during node drain:

08:08:41.096: model-mesh-pre-stop (due to node drain)
08:08:57.272: another modelmesh instance received inference request and gets “ModelNotHereException” from the modelmesh instance in prestop
08:09:06.251: model of inference request loaded again elsewhere (old model mesh pod triggers complete shutdown of triton and itself)
08:09:06.417: inference response is sent

My Question:

is it a misconfiguration on my side, that model mesh does not accept inference requests after pre-stop or is this intented behaviour?

P.S.: Sorry if this Issue should be raised in the modelmesh-serving repo. If that is the case, i will reopen it there.

Provide feedback