-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training a model with multiple GPUs #18
Comments
Hi, I'm sorry, I've never seen this issue.. it could be that the multi-gpu handling, or the file opening, has changed in Keras in the meanwhile? What version of Keras are you using? |
and from which line of the dsen2-cr code is this originating? |
googling the error I foung this issue: tensorflow/tensorflow#30728, maybe you can try downgrading tensorflow and tensorflow-gpu to 1.13.1? |
I actually also tried tensorflow-gpu=1.13.1 and Keras=2.2.4. But the result is the same. I also tried to use the same multiple GPUs as during training for prediction tasks, but another error message appeared. (The question I raised at the beginning was a problem that occurs when using single GPU prediction) |
From my point of view, it is because there is a problem with the model weights trained on multiple GPUs. There are only 8 layer_names, ['input_1', 'input_2', 'lambda_18', 'lambda_19', 'lambda_20', 'lambda_21' , 'model_1', 'lambda_17']. Normally there should be layer_names such as Conv2D in the middle. |
I'm sorry, unfortunately I don't know how to further help with this, as I don't have the capacity to debug this anymore. The issue with the layer names is indeed odd, particularly since the model save and load are handled by native keras functions... |
It's okay. I'll look into it. Thank you very much for your patience! |
Hello @ameraner,
After changing n_gpus=2 for model training, when performing prediction tasks, I encountered the issue 'len(layer_names) != len(filtered_layers)'. The layer_names in the weight file .hdf5 only contains 8 layers, which is significantly different from the model structure. Do you know where the problem lies, or do I need to change other parameters besides n_gpus when using multiple GPUs for training?
The text was updated successfully, but these errors were encountered: