-
Hi, I'm in the process of building a multi-node LLM serving environment via vLLM. But there was a problem in dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
torch.cuda.set_device(local_rank)
world_size = dist.get_world_size()
gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
s = torch.cuda.Stream()
data = torch.FloatTensor(
[
1,
]
* 128
).to("cuda")
with torch.cuda.stream(s):
logger.debug(f"Rank {local_rank}, data before all_reduce: {data}")
pynccl.all_reduce(data, stream=s)
logger.debug(f"Rank {local_rank}, data after all_reduce: {data}")
value = data.mean().item() Also all scripts run in Docker and I ran the Docker container as below: # Master
$ docker run -d \
--name node \
--entrypoint /bin/bash \
--network host \
--ipc host \
--gpus '"device=2,3"' \
-v ./:/vllm-workspace \
-e GLOO_SOCKET_IFNAME=eno3 \
-e NCCL_SOCKET_IFNAME=eno3 \
-e OMP_NUM_THREADS=4 \
vllm/vllm-openai:v0.6.4 \
-c "tail -f /dev/null"
# Worker
$ docker run -d \
--name node \
--entrypoint /bin/bash \
--network host \
--ipc host \
--gpus '"device=2,3"' \
-v ./:/vllm-workspace \
-e GLOO_SOCKET_IFNAME=eno3 \
-e NCCL_SOCKET_IFNAME=eno3 \
-e OMP_NUM_THREADS=4 \
vllm/vllm-openai:v0.6.4 \
-c "tail -f /dev/null" # Master
$ NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --rdzv_backend=c10d --rdzv_endpoint="192.168.74.162:29500" test2.py
# Worker
$ NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --rdzv_backend=c10d --rdzv_endpoint="192.168.74.162:29500" test2.py Important As a result, I expected the value to be changed to world_size before and after [12/19/24 17:35:45] DEBUG [Multi-Node] Rank 1, data before all_reduce: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., test2.py:23
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1.], device='cuda:1')
DEBUG [Multi-Node] Rank 1, data after all_reduce: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., test2.py:25
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1.], device='cuda:1')
[12/19/24 17:35:45] DEBUG [Multi-Node] Rank 0, data before all_reduce: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., test2.py:23
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1.], device='cuda:0')
DEBUG [Multi-Node] Rank 0, data after all_reduce: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., test2.py:25
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1.], device='cuda:0') |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Full logs (my script)[W1219 18:17:11.408663617 socket.cpp:752] [c10d] The client socket cannot be initialized to connect to [::ffff:192.168.74.162]:29500 (errno: 97 - Address family not supported by protocol).
[W1219 18:17:24.992263295 socket.cpp:752] [c10d] The client socket cannot be initialized to connect to [::ffff:192.168.74.162]:29500 (errno: 97 - Address family not supported by protocol).
[W1219 18:17:24.003061514 socket.cpp:752] [c10d] The client socket cannot be initialized to connect to [::ffff:192.168.74.162]:29500 (errno: 97 - Address family not supported by protocol).
INFO 12-19 18:17:49 utils.py:960] Found nccl from library libnccl.so.2
INFO 12-19 18:17:49 pynccl.py:69] vLLM is using nccl==2.21.5
INFO 12-19 18:17:49 utils.py:960] Found nccl from library libnccl.so.2
INFO 12-19 18:17:49 pynccl.py:69] vLLM is using nccl==2.21.5
mncsvr05:1158:1158 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno3
mncsvr05:1158:1158 [0] NCCL INFO Bootstrap : Using eno3:192.168.74.162<0>
mncsvr05:1158:1158 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
mncsvr05:1158:1158 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
mncsvr05:1158:1158 [0] NCCL INFO NET/Plugin: Using internal network plugin.
mncsvr05:1158:1158 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
mncsvr05:1159:1159 [1] NCCL INFO cudaDriverVersion 12040
mncsvr05:1159:1159 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno3
mncsvr05:1159:1159 [1] NCCL INFO Bootstrap : Using eno3:192.168.74.162<0>
mncsvr05:1159:1159 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
mncsvr05:1159:1159 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
mncsvr05:1159:1159 [1] NCCL INFO NET/Plugin: Using internal network plugin.
mncsvr05:1159:1159 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
mncsvr05:1159:1159 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno3
mncsvr05:1159:1159 [1] NCCL INFO NET/Socket : Using [0]eno3:192.168.74.162<0>
mncsvr05:1159:1159 [1] NCCL INFO Using non-device net plugin version 0
mncsvr05:1159:1159 [1] NCCL INFO Using network Socket
mncsvr05:1158:1158 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
mncsvr05:1158:1158 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno3
mncsvr05:1158:1158 [0] NCCL INFO NET/Socket : Using [0]eno3:192.168.74.162<0>
mncsvr05:1158:1158 [0] NCCL INFO Using non-device net plugin version 0
mncsvr05:1158:1158 [0] NCCL INFO Using network Socket
mncsvr05:1158:1158 [0] NCCL INFO ncclCommInitRank comm 0xd055800 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 86000 commId 0x54eb235e416158cc - Init START
mncsvr05:1159:1159 [1] NCCL INFO ncclCommInitRank comm 0xc6f5240 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId af000 commId 0x54eb235e416158cc - Init START
mncsvr05:1158:1158 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
mncsvr05:1158:1158 [0] NCCL INFO Setting affinity for GPU 0 to 1c,38e3c000,0001c38e,3c000000
mncsvr05:1159:1159 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
mncsvr05:1159:1159 [1] NCCL INFO Setting affinity for GPU 1 to 1c,38e3c000,0001c38e,3c000000
mncsvr05:1159:1159 [1] NCCL INFO comm 0xc6f5240 rank 1 nRanks 4 nNodes 2 localRanks 2 localRank 1 MNNVL 0
mncsvr05:1159:1159 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
mncsvr05:1159:1159 [1] NCCL INFO P2P Chunksize set to 131072
mncsvr05:1158:1158 [0] NCCL INFO comm 0xd055800 rank 0 nRanks 4 nNodes 2 localRanks 2 localRank 0 MNNVL 0
mncsvr05:1158:1158 [0] NCCL INFO Channel 00/02 : 0 1 2 3
mncsvr05:1158:1158 [0] NCCL INFO Channel 01/02 : 0 1 2 3
mncsvr05:1158:1158 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
mncsvr05:1158:1158 [0] NCCL INFO P2P Chunksize set to 131072
mncsvr05:1158:1158 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
mncsvr05:1158:1158 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
mncsvr05:1158:1158 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
mncsvr05:1158:1158 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
mncsvr05:1159:1159 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
mncsvr05:1159:1159 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
mncsvr05:1158:1158 [0] NCCL INFO Connected all rings
mncsvr05:1158:1158 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
mncsvr05:1158:1158 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
mncsvr05:1158:1158 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
mncsvr05:1158:1158 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
mncsvr05:1159:1159 [1] NCCL INFO Connected all rings
mncsvr05:1159:1159 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
mncsvr05:1159:1159 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
mncsvr05:1158:1158 [0] NCCL INFO Connected all trees
mncsvr05:1159:1159 [1] NCCL INFO Connected all trees
mncsvr05:1159:1159 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
mncsvr05:1159:1159 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
mncsvr05:1158:1158 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
mncsvr05:1158:1158 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
mncsvr05:1158:1158 [0] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
mncsvr05:1159:1159 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
mncsvr05:1158:1158 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
mncsvr05:1159:1159 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
mncsvr05:1158:1158 [0] NCCL INFO ncclCommInitRank comm 0xd055800 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 86000 commId 0x54eb235e416158cc - Init COMPLETE
mncsvr05:1159:1159 [1] NCCL INFO ncclCommInitRank comm 0xc6f5240 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId af000 commId 0x54eb235e416158cc - Init COMPLETE
[12/19/24 18:17:49] DEBUG [Multi-Node] Rank 0, data before all_reduce: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., test2.py:23
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1.], device='cuda:0')
[12/19/24 18:17:49] DEBUG [Multi-Node] Rank 1, data before all_reduce: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., test2.py:23
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1.], device='cuda:1')
DEBUG [Multi-Node] Rank 0, data after all_reduce: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., test2.py:25
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1.], device='cuda:0')
DEBUG [Multi-Node] Rank 1, data after all_reduce: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., test2.py:25
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1.], device='cuda:1') Full logs (official script)[W1219 21:17:42.749893102 socket.cpp:752] [c10d] The client socket cannot be initialized to connect to [::ffff:192.168.74.162]:29500 (errno: 97 - Address family not supported by protocol).
[W1219 21:17:45.386609151 socket.cpp:752] [c10d] The client socket cannot be initialized to connect to [::ffff:192.168.74.162]:29500 (errno: 97 - Address family not supported by protocol).
[W1219 21:17:45.407129627 socket.cpp:752] [c10d] The client socket cannot be initialized to connect to [::ffff:192.168.74.162]:29500 (errno: 97 - Address family not supported by protocol).
mncsvr05:10083:10083 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno3
mncsvr05:10083:10083 [0] NCCL INFO Bootstrap : Using eno3:192.168.74.162<0>
mncsvr05:10083:10083 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
mncsvr05:10083:10083 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
mncsvr05:10083:10083 [0] NCCL INFO NET/Plugin: Using internal network plugin.
mncsvr05:10083:10083 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
mncsvr05:10084:10084 [1] NCCL INFO cudaDriverVersion 12040
mncsvr05:10084:10084 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno3
mncsvr05:10084:10084 [1] NCCL INFO Bootstrap : Using eno3:192.168.74.162<0>
mncsvr05:10084:10084 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
mncsvr05:10084:10084 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
mncsvr05:10084:10084 [1] NCCL INFO NET/Plugin: Using internal network plugin.
mncsvr05:10083:10100 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno3
mncsvr05:10083:10100 [0] NCCL INFO NET/IB : No device found.
mncsvr05:10083:10100 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno3
mncsvr05:10083:10100 [0] NCCL INFO NET/Socket : Using [0]eno3:192.168.74.162<0>
mncsvr05:10083:10100 [0] NCCL INFO Using non-device net plugin version 0
mncsvr05:10083:10100 [0] NCCL INFO Using network Socket
mncsvr05:10084:10101 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno3
mncsvr05:10084:10101 [1] NCCL INFO NET/IB : No device found.
mncsvr05:10084:10101 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno3
mncsvr05:10084:10101 [1] NCCL INFO NET/Socket : Using [0]eno3:192.168.74.162<0>
mncsvr05:10084:10101 [1] NCCL INFO Using non-device net plugin version 0
mncsvr05:10084:10101 [1] NCCL INFO Using network Socket
mncsvr05:10083:10100 [0] NCCL INFO ncclCommInitRank comm 0x789d510 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 86000 commId 0x4cb598ba902c6d03 - Init START
mncsvr05:10084:10101 [1] NCCL INFO ncclCommInitRank comm 0x8f22db0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId af000 commId 0x4cb598ba902c6d03 - Init START
mncsvr05:10084:10101 [1] NCCL INFO Setting affinity for GPU 1 to 1c,38e3c000,0001c38e,3c000000
mncsvr05:10083:10100 [0] NCCL INFO Setting affinity for GPU 0 to 1c,38e3c000,0001c38e,3c000000
mncsvr05:10084:10101 [1] NCCL INFO comm 0x8f22db0 rank 1 nRanks 4 nNodes 2 localRanks 2 localRank 1 MNNVL 0
mncsvr05:10084:10101 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
mncsvr05:10083:10100 [0] NCCL INFO comm 0x789d510 rank 0 nRanks 4 nNodes 2 localRanks 2 localRank 0 MNNVL 0
mncsvr05:10084:10101 [1] NCCL INFO P2P Chunksize set to 131072
mncsvr05:10083:10100 [0] NCCL INFO Channel 00/02 : 0 1 2 3
mncsvr05:10083:10100 [0] NCCL INFO Channel 01/02 : 0 1 2 3
mncsvr05:10083:10100 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
mncsvr05:10083:10100 [0] NCCL INFO P2P Chunksize set to 131072
mncsvr05:10083:10100 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
mncsvr05:10083:10100 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
mncsvr05:10083:10100 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
mncsvr05:10083:10100 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
mncsvr05:10084:10101 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
mncsvr05:10084:10101 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
mncsvr05:10084:10101 [1] NCCL INFO Connected all rings
mncsvr05:10084:10101 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
mncsvr05:10083:10100 [0] NCCL INFO Connected all rings
mncsvr05:10084:10101 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
mncsvr05:10083:10100 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
mncsvr05:10083:10100 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
mncsvr05:10083:10100 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
mncsvr05:10083:10100 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
mncsvr05:10083:10100 [0] NCCL INFO Connected all trees
mncsvr05:10084:10101 [1] NCCL INFO Connected all trees
mncsvr05:10084:10101 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
mncsvr05:10084:10101 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
mncsvr05:10083:10100 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
mncsvr05:10083:10100 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
mncsvr05:10083:10100 [0] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
mncsvr05:10084:10101 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
mncsvr05:10083:10100 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
mncsvr05:10084:10101 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
mncsvr05:10083:10100 [0] NCCL INFO ncclCommInitRank comm 0x789d510 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 86000 commId 0x4cb598ba902c6d03 - Init COMPLETE
mncsvr05:10084:10101 [1] NCCL INFO ncclCommInitRank comm 0x8f22db0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId af000 commId 0x4cb598ba902c6d03 - Init COMPLETE
PyTorch NCCL is successful!PyTorch NCCL is successful!
PyTorch GLOO is successful!PyTorch GLOO is successful!
INFO 12-19 21:17:49 utils.py:960] Found nccl from library libnccl.so.2
INFO 12-19 21:17:49 pynccl.py:69] vLLM is using nccl==2.21.5
INFO 12-19 21:17:49 utils.py:960] Found nccl from library libnccl.so.2
INFO 12-19 21:17:49 pynccl.py:69] vLLM is using nccl==2.21.5
mncsvr05:10083:10083 [0] NCCL INFO Using non-device net plugin version 0
mncsvr05:10083:10083 [0] NCCL INFO Using network Socket
mncsvr05:10084:10084 [1] NCCL INFO Using non-device net plugin version 0
mncsvr05:10084:10084 [1] NCCL INFO Using network Socket
mncsvr05:10083:10083 [0] NCCL INFO ncclCommInitRank comm 0xadce820 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 86000 commId 0x426866d17ac8fa50 - Init START
mncsvr05:10084:10084 [1] NCCL INFO ncclCommInitRank comm 0xc476720 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId af000 commId 0x426866d17ac8fa50 - Init START
mncsvr05:10083:10083 [0] NCCL INFO Setting affinity for GPU 0 to 1c,38e3c000,0001c38e,3c000000
mncsvr05:10084:10084 [1] NCCL INFO Setting affinity for GPU 1 to 1c,38e3c000,0001c38e,3c000000
mncsvr05:10084:10084 [1] NCCL INFO comm 0xc476720 rank 1 nRanks 4 nNodes 2 localRanks 2 localRank 1 MNNVL 0
mncsvr05:10084:10084 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
mncsvr05:10084:10084 [1] NCCL INFO P2P Chunksize set to 131072
mncsvr05:10083:10083 [0] NCCL INFO comm 0xadce820 rank 0 nRanks 4 nNodes 2 localRanks 2 localRank 0 MNNVL 0
mncsvr05:10083:10083 [0] NCCL INFO Channel 00/02 : 0 1 2 3
mncsvr05:10083:10083 [0] NCCL INFO Channel 01/02 : 0 1 2 3
mncsvr05:10083:10083 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
mncsvr05:10083:10083 [0] NCCL INFO P2P Chunksize set to 131072
mncsvr05:10083:10083 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
mncsvr05:10083:10083 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/0
mncsvr05:10083:10083 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
mncsvr05:10083:10083 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
mncsvr05:10084:10084 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
mncsvr05:10084:10084 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
mncsvr05:10084:10084 [1] NCCL INFO Connected all rings
mncsvr05:10083:10083 [0] NCCL INFO Connected all rings
mncsvr05:10084:10084 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
mncsvr05:10084:10084 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
mncsvr05:10083:10083 [0] NCCL INFO Channel 00/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
mncsvr05:10083:10083 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
mncsvr05:10083:10083 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[0] [send] via NET/Socket/0
mncsvr05:10083:10083 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
mncsvr05:10083:10083 [0] NCCL INFO Connected all trees
mncsvr05:10084:10084 [1] NCCL INFO Connected all trees
mncsvr05:10084:10084 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
mncsvr05:10084:10084 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
mncsvr05:10083:10083 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
mncsvr05:10083:10083 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
mncsvr05:10083:10083 [0] NCCL INFO ncclCommInitRank comm 0xadce820 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 86000 commId 0x426866d17ac8fa50 - Init COMPLETE
mncsvr05:10084:10084 [1] NCCL INFO ncclCommInitRank comm 0xc476720 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId af000 commId 0x426866d17ac8fa50 - Init COMPLETE
[rank0]: Traceback (most recent call last):
[rank0]: File "/vllm-workspace/test.py", line 38, in <module>
[rank0]: assert value == world_size, f"Expected {world_size}, got {value}"
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: AssertionError: Expected 4, got 1.0
[rank1]: Traceback (most recent call last):
[rank1]: File "/vllm-workspace/test.py", line 38, in <module>
[rank1]: assert value == world_size, f"Expected {world_size}, got {value}"
[rank1]: ^^^^^^^^^^^^^^^^^^^
[rank1]: AssertionError: Expected 4, got 1.0
[rank0]:[W1219 21:17:49.704183432 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
mncsvr05:10083:10102 [0] NCCL INFO [Service thread] Connection closed by localRank 0
mncsvr05:10084:10103 [1] NCCL INFO [Service thread] Connection closed by localRank 1
mncsvr05:10083:10131 [0] NCCL INFO comm 0x789d510 rank 0 nranks 4 cudaDev 0 busId 86000 - Abort COMPLETE
mncsvr05:10084:10132 [1] NCCL INFO comm 0x8f22db0 rank 1 nranks 4 cudaDev 1 busId af000 - Abort COMPLETE
W1219 21:17:50.923000 10077 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 10084 closing signal SIGTERM
E1219 21:17:51.088000 10077 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 10083) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
test.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-12-19_21:17:50
host : mncsvr05
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 10083)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================ |
Beta Was this translation helpful? Give feedback.
-
$ diff 0.6.4.py latest.py
32d31
< pynccl.disabled = False |
Beta Was this translation helpful? Give feedback.
$ diff 0.6.4.py latest.py 32d31 < pynccl.disabled = False