RuntimeError: Timed out initializing process group in store based barrier on rank 2 #3626

SingL3 · 2023-08-02T02:43:33Z

I am trying to run pretrain of LLaMA 30b. And here is my running cmd:

deepspeed trainer_sft.py --configs defaults llama-30b-pretrain pretrain --cache_dir $DATA_PATH --output_dir $MODEL_PATH/llama-30b-pre --deepspeed

And after the model was loaded, it stucked for a long time(I think it was 30 mins for the default timeout of pytorch is 30mins).
And this error is raised:

RuntimeError: Timed out initializing process group in store based barrier on rank 2 # for all rank

Any solutions?

The text was updated successfully, but these errors were encountered:

andreaskoepf · 2023-08-08T09:04:14Z

We have not seen this error during our training runs. Could you try smaller/different models first? Are you using the latest version of deepspeed? Which GPU and cuda version are you using? Do you have access to a different machine on which you could cross-check?

SingL3 · 2023-08-08T11:37:17Z

@andreaskoepf
Yes, at least latest version last week and deepspeed.
I am using 8xA100(80G) with cuda 11.7.
I have tried reducing pretrain datasets here(only alpaca_gpt4 is reserved) and it can run successfully so I dont think it is the reason of the model.

olliestanley added the ml label Aug 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Timed out initializing process group in store based barrier on rank 2 #3626

RuntimeError: Timed out initializing process group in store based barrier on rank 2 #3626

SingL3 commented Aug 2, 2023

andreaskoepf commented Aug 8, 2023

SingL3 commented Aug 8, 2023

RuntimeError: Timed out initializing process group in store based barrier on rank 2 #3626

RuntimeError: Timed out initializing process group in store based barrier on rank 2 #3626

Comments

SingL3 commented Aug 2, 2023

andreaskoepf commented Aug 8, 2023

SingL3 commented Aug 8, 2023