A small question about trying using distributed training to the ColossalAI-Examples #295
Replies: 4 comments 1 reply
-
Hi, this seems irrelevant to the launch command. Could you provide your environment specufications (cuda, nccl, cudnn, pytorch versions) for us to check? |
Beta Was this translation helpful? Give feedback.
-
As the log says there is no network interface found. Can you try ifconfig to see what interfaces are present? |
Beta Was this translation helpful? Give feedback.
-
According to the pytorch docs, try use
The pytorch module will automatically initialize the environment variables for you.
On the second node, run
|
Beta Was this translation helpful? Give feedback.
-
According to your screenshot, the error occurs in the Another issue is that you set |
Beta Was this translation helpful? Give feedback.
-
This is my original startup shellcode and it works fine.
I tried to adapt it for distributed training on two machines.
machine 1 ip 172.16.220.18
machine 2 ip 172.16.220.21
This is shellcode that I adapted.
I noticed that the initialization of Colossal-ai was read from environment variables, so I added the following to the "train_gpt.py".
I don't know whether I'm configured correctly, but the following error occurred.
Any help would be much appreciated!
Beta Was this translation helpful? Give feedback.
All reactions