You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From time to time, our system spins out of control, throwing many ModelNotHereExceptions which eventually leading to "8 retry iterations exhausted for model".
Our registration process is completely automated, and triggered by a registerModel gRPC request (instead of a yaml configuration), followed by ensureLoaded request to validate that the registration has completed successfully.
Models:
The issues is not consistent per model: a failing invocation of a model can be successful on the next try, in case that the request is directed to a not-faulty mm pod (see the next section).
MM pods:
We have a few dozens mm pods , and the issues is very prominent in only some of them (<50%), addressed as "faulty" pods. Faulty pods are still functioning, meaning they are able to serve, run predictions and invoke internal requests, but have very high error rate due to the ModelNotHereExceptions.
It looks like faulty pods are somehow not synced with ETCD and invoke random internal requests.
All the mm pods are not new, and are running for a days/hours before the issue starts.
Note that non-faulty pods are also throwing these errors from time to time.
ETCD:
We do however suspect the ETCD, since its pods were restarted (for reasons unclear to us yet) and the faulty pods are only ones that were created prior to the ETCD restart.
Mitigation:
The issue usually stops when there is a scale in event, so some of the pods are terminated.
Note that a faulty pod might not be terminated, but the errors are stopped due to a termination of a different pod (maybe on that the problematic model was loaded on).
Example:
In the attached log file, you can see that a newly registered model 4774912c is facing this issue, even though it was loaded on modelmesh-serving-triton-2.x-768448c4fb-q9564.
The external requests to the many faulty pods, are directed to 8 pods, which none of them is modelmesh-serving-triton-2.x-768448c4fb-q9564.
Describe the bug
From time to time, our system spins out of control, throwing many ModelNotHereExceptions which eventually leading to "8 retry iterations exhausted for model".
Our registration process is completely automated, and triggered by a
registerModel
gRPC request (instead of a yaml configuration), followed byensureLoaded
request to validate that the registration has completed successfully.Models:
The issues is not consistent per model: a failing invocation of a model can be successful on the next try, in case that the request is directed to a not-faulty mm pod (see the next section).
MM pods:
We have a few dozens mm pods , and the issues is very prominent in only some of them (<50%), addressed as "faulty" pods. Faulty pods are still functioning, meaning they are able to serve, run predictions and invoke internal requests, but have very high error rate due to the ModelNotHereExceptions.
It looks like faulty pods are somehow not synced with ETCD and invoke random internal requests.
All the mm pods are not new, and are running for a days/hours before the issue starts.
Note that non-faulty pods are also throwing these errors from time to time.
ETCD:
We do however suspect the ETCD, since its pods were restarted (for reasons unclear to us yet) and the faulty pods are only ones that were created prior to the ETCD restart.
Mitigation:
The issue usually stops when there is a scale in event, so some of the pods are terminated.
Note that a faulty pod might not be terminated, but the errors are stopped due to a termination of a different pod (maybe on that the problematic model was loaded on).
Example:
In the attached log file, you can see that a newly registered model
4774912c
is facing this issue, even though it was loaded onmodelmesh-serving-triton-2.x-768448c4fb-q9564
.The external requests to the many faulty pods, are directed to 8 pods, which none of them is
modelmesh-serving-triton-2.x-768448c4fb-q9564
.report.csv
As you can see, the situation is very peculiar and we are not sure how to investigate further.
We are curious:
Thanks!
The text was updated successfully, but these errors were encountered: