Training a model with multiple GPUs #18

zasr99 · 2024-03-22T11:35:52Z

Hello @ameraner,
After changing n_gpus=2 for model training, when performing prediction tasks, I encountered the issue 'len(layer_names) != len(filtered_layers)'. The layer_names in the weight file .hdf5 only contains 8 layers, which is significantly different from the model structure. Do you know where the problem lies, or do I need to change other parameters besides n_gpus when using multiple GPUs for training?

ameraner · 2024-04-02T09:44:53Z

Hi, I'm sorry, I've never seen this issue.. it could be that the multi-gpu handling, or the file opening, has changed in Keras in the meanwhile? What version of Keras are you using?

zasr99 · 2024-04-02T09:57:17Z

Hi, I'm sorry, I've never seen this issue.. it could be that the multi-gpu handling, or the file opening, has changed in Keras in the meanwhile? What version of Keras are you using?

Thank you for your reply! I use tensorflow-gpu=1.15 and Keras=2.3.1. When I use tensorflow-gpu=1.15 and Keras=2.2.4, multi-GPU training cannot be performed. But when Keras=2.3.1 uses multi-GPU training, the problem I raised will appear again.

As shown in the figure above, when tensorflow-gpu=1.15 and Keras=2.2.4, the error message appears when using n_gpu>=2

ameraner · 2024-04-02T10:26:54Z

and from which line of the dsen2-cr code is this originating?

zasr99 · 2024-04-02T10:36:18Z

and from which line of the dsen2-cr code is this originating?

This is the complete error message when using tensorflow-gpu=1.15 and Keras=2.2.4 for multi-GPU training.
After checking the information, it seems that this is caused by the mismatch between tensorflow-gpu and Keras versions.
So I changed to use tensorflow-gpu=1.15 and Keras=2.3.1, meanwhile ,the training process can be completed normally, but when using the trained model for prediction, the error message I proposed at the beginning will appear.

ameraner · 2024-04-02T10:49:30Z

googling the error I foung this issue: tensorflow/tensorflow#30728, maybe you can try downgrading tensorflow and tensorflow-gpu to 1.13.1?

zasr99 · 2024-04-02T11:06:25Z

googling the error I foung this issue: tensorflow/tensorflow#30728, maybe you can try downgrading tensorflow and tensorflow-gpu to 1.13.1?

I actually also tried tensorflow-gpu=1.13.1 and Keras=2.2.4. But the result is the same. I also tried to use the same multiple GPUs as during training for prediction tasks, but another error message appeared.

(The question I raised at the beginning was a problem that occurs when using single GPU prediction)
In general, in my attempts, the model trained with multiple GPUs will report errors whether it is using a single GPU or multiple GPUs for prediction tasks, but the error problems are different.

zasr99 · 2024-04-02T11:18:28Z

googling the error I foung this issue: tensorflow/tensorflow#30728, maybe you can try downgrading tensorflow and tensorflow-gpu to 1.13.1?

From my point of view, it is because there is a problem with the model weights trained on multiple GPUs. There are only 8 layer_names, ['input_1', 'input_2', 'lambda_18', 'lambda_19', 'lambda_20', 'lambda_21' , 'model_1', 'lambda_17'].

Normally there should be layer_names such as Conv2D in the middle.

ameraner · 2024-04-02T13:21:12Z

I'm sorry, unfortunately I don't know how to further help with this, as I don't have the capacity to debug this anymore. The issue with the layer names is indeed odd, particularly since the model save and load are handled by native keras functions...

zasr99 · 2024-04-02T14:55:44Z

I'm sorry, unfortunately I don't know how to further help with this, as I don't have the capacity to debug this anymore. The issue with the layer names is indeed odd, particularly since the model save and load are handled by native keras functions...

It's okay. I'll look into it. Thank you very much for your patience！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training a model with multiple GPUs #18

Training a model with multiple GPUs #18

zasr99 commented Mar 22, 2024

ameraner commented Apr 2, 2024

zasr99 commented Apr 2, 2024

ameraner commented Apr 2, 2024

zasr99 commented Apr 2, 2024

ameraner commented Apr 2, 2024

zasr99 commented Apr 2, 2024

zasr99 commented Apr 2, 2024

ameraner commented Apr 2, 2024

zasr99 commented Apr 2, 2024

Training a model with multiple GPUs #18

Training a model with multiple GPUs #18

Comments

zasr99 commented Mar 22, 2024

ameraner commented Apr 2, 2024

zasr99 commented Apr 2, 2024

ameraner commented Apr 2, 2024

zasr99 commented Apr 2, 2024

ameraner commented Apr 2, 2024

zasr99 commented Apr 2, 2024

zasr99 commented Apr 2, 2024

ameraner commented Apr 2, 2024

zasr99 commented Apr 2, 2024