-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gRPC error occuring in worker tuners stalls chief with possible trial loss #915
Comments
According to the error stack trace I assumed that error was for all tuners. However, just to check I repeated the tests with the random and hyperband tuners. Try as I might, I cannot get then random tuner to fail. Maybe I am not trying hard enough. However, I was able to get the hyperband tuner to fails. Unfortunately this seems to be another error. Here is the trace: Traceback (most recent call last):
File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 270, in _try_run_and_update_trial
self._run_and_update_trial(trial, *fit_args, **fit_kwargs)
File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 235, in _run_and_update_trial
results = self.run_trial(trial, *fit_args, **fit_kwargs)
File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/tuners/hyperband.py", line 425, in run_trial
return super().run_trial(trial, *fit_args, **fit_kwargs)
File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/tuner.py", line 287, in run_trial
obj_value = self._build_and_fit_model(trial, *args, **copied_kwargs)
File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/tuner.py", line 213, in _build_and_fit_model
model = self._try_build(hp)
File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/tuner.py", line 155, in _try_build
model = self._build_hypermodel(hp)
File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/tuners/hyperband.py", line 432, in _build_hypermodel
model.load_weights(self._get_checkpoint_fname(trial_id))
File "/home/vscode/.local/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/vscode/.local/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 31, in error_translator
raise errors_impl.NotFoundError(None, None, error_message)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ../results/keras_tuner_ex1/project_keras_tuner_ex1/trial_0005/checkpoint
Traceback (most recent call last):
File "/workspaces/Unsupervised-Anomaly-Detection-with-SSIM-AE/KerasTunerEx2.py", line 129, in <module>
tuner.search(
File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 231, in search
self.on_trial_end(trial)
File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 335, in on_trial_end
self.oracle.end_trial(trial)
File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/distribute/oracle_client.py", line 90, in end_trial
self.stub.EndTrial(
File "/home/vscode/.local/lib/python3.10/site-packages/grpc/_channel.py", line 1030, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/home/vscode/.local/lib/python3.10/site-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception calling application: can only concatenate str (not "NoneType") to str"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Exception calling application: can only concatenate str (not \"NoneType\") to str", grpc_status:2, created_time:"2023-07-13T09:17:41.3629395+00:00"}"
> I am wondering if the cause may not be the same. I used the same code but added the hyperband tuner. The error however seems to manifest itself only when I first launch the slaves and then the master. Also what is interesting is that the worker actually starts work and reports something (note Search: Running Trial #14
Value |Best Value So Far |Hyperparameter
64 |40 |units
4 |2 |tuner/epochs
0008 |None |tuner/trial_id
2 |2 |tuner/bracket
1 |0 |tuner/round
2 |0 |tuner/initial_epoch
Traceback (most recent call last):
File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 270, in _try_run_and_update_trial
self._run_and_update_trial(trial, *fit_args, **fit_kwargs)
File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/engine/base_tuner.py", line 235, in _run_and_update_trial
results = self.run_trial(trial, *fit_args, **fit_kwargs)
File "/home/vscode/.local/lib/python3.10/site-packages/keras_tuner/tuners/hyperband.py", line 425, in run_trial
return super().run_trial(trial, *fit_args, **fit_kwargs)
File I would have expected the worker to poll the chief first and only then start its work. Restarting the workers always produces the same result. As with the case above, once a worker fails, the chief will stall and needs to be killed. To get this to work again I first launch the workers and then the master. EDIT: this error also occurs with the Bayes tuner 8-( |
Describe the bug
While trying to refactor and correct a bug of mine, I started getting errors I did not detect before. I made an initial report here but (as expected), the problem lies elsewhere. I have created a small example that shows the same exception that occurs with a high probability. This problem may be due to some race condition that I have been able to trigger in nearly every experiment. Here is the error I get in one or more workers:
Usually, when all workers get several trials to perform, the chief shows the message:
and then terminates gracefully. In this case, no error occurs. Another thing I have noticed is that usually when the error occurs, the chief generates a
ConvergenceWarning
as shown below:When the exception above does occur, the chief will still print the message that it will exit in 40 seconds. However, it stalls indefinitely. I need to kill it with a
Ctrl-C
orkill
command.The probability that this error occurs increases with the number of trials. So the code I provide runs 20 trials using a single hyperparameter with about 10 values. I limit the number of epochs of each trial to make testing faster. It is not guaranteed for the exception to occur in every experiment, but it does happen very often. In my original experiments, this occurs all the time.
To Reproduce
I first launch 3 workers as follows:
cd scripts ./tune_slave_ex1.sh 1 ./tune_slave_ex1.sh 2 ./tune_slave_ex1.sh 3
I then launch the chief:
Usually within 2 attempts, I get the exception. YMMV.
keras_tuner_ex1_1.zip
I did the experiments in a VM with Linux (Ubuntu).
I use:
I use VSCode to set up the container and have the following Python packages installed:
Expected behavior
I expect the chief to terminate and no exceptions to occur in the workers.
Additional context
Besides correcting this issue, I would like to know if there is some workaround I can use to prevent this issue. I cannot confirm when the workers fail but have noticed a drop in CPU usage. The goal here is to run the experiments in as short a time as possible and as soon as it terminates to run some additional evaluation code in the chief. On both counts, because of this issue, I cannot do this automatically. I am also unsure if trials are lost.
Would you like to help us fix it?
Sure. Might need some guidance.
The text was updated successfully, but these errors were encountered: