You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We observed that in some cases distributed keras tuner fails.
It's caused by the fix to prevent keras tuner hanging forever - 5 min RPC timeouts were introduced (see #957).
Now if keras chief worker startup takes longer than 5 minutes, then the client gives up and fails the whole tuning process.
Normally RPC server startup is quick, but in some cases it might take slightly longer.
The planned solution is to increase the client timeout to 1h. We still need the timeout to prevent tuner clients from hanging forever. We need the timeout to be high enough so that chief oracle server always has enough time to start.
The text was updated successfully, but these errors were encountered:
The timeout is so high to prevent a rare race condition from happening.
We need clients to wait till chief oracle server starts. This normally takes
a few minutes, but sometimes might take longer.
See keras-team#990 for more details.
Initially we didn't have any timeout. It was introduced to avoid tuner jobs
hanging forever if chief oracle stops responding.
See keras-team#957.
We observed that in some cases distributed keras tuner fails.
It's caused by the fix to prevent keras tuner hanging forever - 5 min RPC timeouts were introduced (see #957).
Now if keras chief worker startup takes longer than 5 minutes, then the client gives up and fails the whole tuning process.
Normally RPC server startup is quick, but in some cases it might take slightly longer.
The planned solution is to increase the client timeout to 1h. We still need the timeout to prevent tuner clients from hanging forever. We need the timeout to be high enough so that chief oracle server always has enough time to start.
The text was updated successfully, but these errors were encountered: