Multi-gpu NCCL test test_all_reduce_coalesced_nccl failing #35

ajindal1 · 2022-07-18T19:57:18Z

While running the Multi-gpu Pytorch tests, test_all_reduce_coalesced_nccl is failing in pytorch/test/test_c10d_nccl.py. It seems like the error is coming because of inconsistent results from allreduce. The information on the logs is as follows:
171495ffc000000:237471:237471 [0] NCCL INFO Bootstrap : Using eth0:10.1.0.4<0>
171495ffc000000:237471:237471 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
171495ffc000000:237471:237471 [0] NCCL INFO Failed to open libibverbs.so[.1]
171495ffc000000:237471:237471 [0] NCCL INFO NET/Socket : Using [0]eth0:10.1.0.4<0>
171495ffc000000:237471:237471 [0] NCCL INFO Using network Socket
NCCL version 2.12.12.MSCCL.0.1+cuda11.3
171495ffc000000:237471:237535 [0] NCCL INFO init.cc:233 Cuda Host Alloc Size 4 pointer 0x203400000
171495ffc000000:237472:237472 [1] NCCL INFO Bootstrap : Using eth0:10.1.0.4<0>
171495ffc000000:237472:237472 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
171495ffc000000:237472:237472 [1] NCCL INFO Failed to open libibverbs.so[.1]
171495ffc000000:237472:237472 [1] NCCL INFO NET/Socket : Using [0]eth0:10.1.0.4<0>
171495ffc000000:237472:237472 [1] NCCL INFO Using network Socket
171495ffc000000:237472:237536 [1] NCCL INFO init.cc:233 Cuda Host Alloc Size 4 pointer 0x206800000
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_width, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_width, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_width, ignoring
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_width, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/000d3a1f-1dcf-000d-3a1f-1dcf000d3a1f is not a PCI device (vmbus). Attaching to first CPU
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/000d3a1f-1dcf-000d-3a1f-1dcf000d3a1f is not a PCI device (vmbus). Attaching to first CPU
171495ffc000000:237471:237535 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237472:237536 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237472:237536 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237471:237535 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237471:237535 [0] NCCL INFO === System : maxWidth 12.0 totalWidth 12.0 ===

Additional error info:
ERROR:torch.testing._internal.common_distributed:Caught exception:
Traceback (most recent call last):
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_distributed.py", line 601, in run_test
getattr(self, test_name)()
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_distributed.py", line 486, in wrapper
fn()
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_utils.py", line 3098, in wrapper
return func(*args, **kwargs)
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_distributed.py", line 131, in wrapper
return func(*args, **kwargs)
File "/mnt/vss/_work/1/s/test/pytorch/test/distributed/test_c10d_nccl.py", line 2867, in test_all_reduce_coalesced_nccl
self.assertEqual(t, torch.full_like(t, self.world_size * (i + (self.world_size + 1.) / 2.)))
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_utils.py", line 2121, in assertEqual
assert_equal(
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_comparison.py", line 1080, in assert_equal
raise error_metas[0].to_error()
AssertionError: Tensor-likes are not close!
Mismatched elements: 60 / 60 (100.0%)
Greatest absolute difference: 1.0 at index 0 (up to 1e-05 allowed)
Greatest relative difference: 0.3333333333333333 at index 0 (up to 1.3e-06 allowed)

exiting process 1 with exit code: 10

saeedmaleki · 2022-07-18T22:37:43Z

Thanks @ajindal1 for reporting this! This commit should fix it:
cb4c0c7
Please confirm it and then close the issue.

saeedmaleki · 2022-08-02T19:36:08Z

Hi @ajindal1, is this addressed now?

ajindal1 · 2022-08-02T23:32:42Z

I believe this error was fixed, but there is another error which still remains and might be related to the hardware capacity of the machine.
Logs:
ERROR:torch.testing._internal.common_distributed:Caught exception:
Traceback (most recent call last):
File "/mnt/vss/_work/1/s/test/pytorch/test/distributed/test_c10d_nccl.py", line 996, in test_nccl_propagate_error_reason
pg.allreduce([torch.ones(2).cuda(self.rank)]).wait()
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from memcpy_and_sync at /mnt/vss/_work/1/s/test/pytorch/c10/cuda/CUDAFunctions.h:75 (most recent call first):

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-gpu NCCL test test_all_reduce_coalesced_nccl failing #35

Multi-gpu NCCL test test_all_reduce_coalesced_nccl failing #35

ajindal1 commented Jul 18, 2022

saeedmaleki commented Jul 18, 2022

saeedmaleki commented Aug 2, 2022

ajindal1 commented Aug 2, 2022

Multi-gpu NCCL test test_all_reduce_coalesced_nccl failing #35

Multi-gpu NCCL test test_all_reduce_coalesced_nccl failing #35

Comments

ajindal1 commented Jul 18, 2022

saeedmaleki commented Jul 18, 2022

saeedmaleki commented Aug 2, 2022

ajindal1 commented Aug 2, 2022