A small question about trying using distributed training to the ColossalAI-Examples #295

xxw11 · 2022-03-02T10:50:20Z

xxw11
Mar 2, 2022

This is my original startup shellcode and it works fine.

I tried to adapt it for distributed training on two machines.
machine 1 ip 172.16.220.18
machine 2 ip 172.16.220.21

This is shellcode that I adapted.

I noticed that the initialization of Colossal-ai was read from environment variables, so I added the following to the "train_gpt.py".

I don't know whether I'm configured correctly, but the following error occurred.

Any help would be much appreciated!

FrankLeeeee · 2022-03-02T11:59:59Z

FrankLeeeee
Mar 2, 2022

Hi, this seems irrelevant to the launch command. Could you provide your environment specufications (cuda, nccl, cudnn, pytorch versions) for us to check？

0 replies

FrankLeeeee · 2022-03-02T12:01:49Z

FrankLeeeee
Mar 2, 2022

As the log says there is no network interface found. Can you try ifconfig to see what interfaces are present？

1 reply

xxw11 Mar 2, 2022
Author

There are many results of ifconfig. Can I intercept them like this?
machine2
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.16.220.21 netmask 255.255.255.0 broadcast 172.16.220.255
inet6 2001:250:6806:172:9:599:0:21 prefixlen 128 scopeid 0x0
inet6 fe80::5173:75d5:2010:8f26 prefixlen 64 scopeid 0x20
ether ac:1f:6b:a1:1a:f6 txqueuelen 1000 (Ethernet)
RX packets 9624993 bytes 10465070578 (10.4 GB)
RX errors 0 dropped 618 overruns 0 frame 0
TX packets 8116232 bytes 4012502111 (4.0 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
machine1
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.16.220.18 netmask 255.255.255.0 broadcast 172.16.220.255
inet6 fe80::375a:cc1e:21d6:6a6b prefixlen 64 scopeid 0x20
inet6 2001:250:6806:172:9:599:0:27 prefixlen 128 scopeid 0x0
ether 3c:ec:ef:b1:74:ea txqueuelen 1000 (Ethernet)
RX packets 22598576 bytes 12299998071 (12.2 GB)
RX errors 0 dropped 1742 overruns 0 frame 0
TX packets 5259705 bytes 873786785 (873.7 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

kurisusnowdeng · 2022-03-02T13:54:48Z

kurisusnowdeng
Mar 2, 2022
Maintainer

According to the pytorch docs, try use torchrun with multiple nodes in the following way:

torchrun --nproc_per_node=NUM_GPUS_PER_NODE \
         --nnodes=NUM_NODES \
         --node_rank=NODE_RANK \
         --master_addr=MASTER_IP_ADDRESS \
         --master_port=MASTER_PORT \
         YOUR_SCRIPT.py --ARGUMENTS_OF_YOUR_SCRIPT

The pytorch module will automatically initialize the environment variables for you.
E.g. on the first node, run

torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 \
         --master_addr=172.16.220.18 --master_port=12306 \
         train_gpt.py --config=../gpt2_configs/gpt2_2d.py --from_torch

On the second node, run

torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 \
         --master_addr=172.16.220.18 --master_port=12306 \
         train_gpt.py --config=../gpt2_configs/gpt2_2d.py --from_torch

0 replies

FrankLeeeee · 2022-03-02T14:25:33Z

FrankLeeeee
Mar 2, 2022

According to your screenshot, the error occurs in the colossalai.initialize call instead of colossalai.launch. Thus, I would assume you should be able to see a log message like this Distributed environment is initialized, data parallel size: 8, pipeline parallel size: 1, tensor parallel size: 4. Please make sure that this message aligns with your total number of GPUs of two nodes. If this number does not align, one possible case is that you only launch your script on one node.

Another issue is that you set NCCL_SOCKET_IFNAME to enp, however, this network interface does not exist. You only have eno1 according to your ifconfig output, so change your code to `os.environ['NCCL_SOCKET_IFNAME'] = 'eno1' and give it a try.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A small question about trying using distributed training to the ColossalAI-Examples #295

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

A small question about trying using distributed training to the ColossalAI-Examples #295

xxw11 Mar 2, 2022

Replies: 4 comments · 1 reply

FrankLeeeee Mar 2, 2022

FrankLeeeee Mar 2, 2022

xxw11 Mar 2, 2022 Author

kurisusnowdeng Mar 2, 2022 Maintainer

FrankLeeeee Mar 2, 2022

xxw11
Mar 2, 2022

Replies: 4 comments 1 reply

FrankLeeeee
Mar 2, 2022

FrankLeeeee
Mar 2, 2022

xxw11 Mar 2, 2022
Author

kurisusnowdeng
Mar 2, 2022
Maintainer

FrankLeeeee
Mar 2, 2022