You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
And after the model was loaded, it stucked for a long time(I think it was 30 mins for the default timeout of pytorch is 30mins).
And this error is raised:
RuntimeError: Timed out initializing process group in store based barrier on rank 2 # for all rank
Any solutions?
The text was updated successfully, but these errors were encountered:
We have not seen this error during our training runs. Could you try smaller/different models first? Are you using the latest version of deepspeed? Which GPU and cuda version are you using? Do you have access to a different machine on which you could cross-check?
@andreaskoepf
Yes, at least latest version last week and deepspeed.
I am using 8xA100(80G) with cuda 11.7.
I have tried reducing pretrain datasets here(only alpaca_gpt4 is reserved) and it can run successfully so I dont think it is the reason of the model.
I am trying to run pretrain of LLaMA 30b. And here is my running cmd:
And after the model was loaded, it stucked for a long time(I think it was 30 mins for the default timeout of pytorch is 30mins).
And this error is raised:
Any solutions?
The text was updated successfully, but these errors were encountered: