You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am attempting to train a model with 3 billion parameters on two A100 GPUs using nvidia-tensorflow 1.15 (21.07-tf1-py3), with a batch size of 24 and tf.distribute.MirroredStrategy.
This seems to be an issue that occurs only when the model is large enough and distributed training is used (as the model trains successfully on a single GPU with a batch_size of 12 and on two GPUs with a model size of 1.5B).
I understand that using TensorFlow for training large models may not be the best option, but at present, I need to address this issue.
The text was updated successfully, but these errors were encountered:
I am attempting to train a model with 3 billion parameters on two A100 GPUs using
nvidia-tensorflow 1.15 (21.07-tf1-py3)
, with a batch size of 24 andtf.distribute.MirroredStrategy
.The error message is:
This seems to be an issue that occurs only when the model is large enough and distributed training is used (as the model trains successfully on a single GPU with a batch_size of 12 and on two GPUs with a model size of 1.5B).
I understand that using TensorFlow for training large models may not be the best option, but at present, I need to address this issue.
The text was updated successfully, but these errors were encountered: