You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While running the Multi-gpu Pytorch tests, test_all_reduce_coalesced_nccl is failing in pytorch/test/test_c10d_nccl.py. It seems like the error is coming because of inconsistent results from allreduce. The information on the logs is as follows:
171495ffc000000:237471:237471 [0] NCCL INFO Bootstrap : Using eth0:10.1.0.4<0>
171495ffc000000:237471:237471 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
171495ffc000000:237471:237471 [0] NCCL INFO Failed to open libibverbs.so[.1]
171495ffc000000:237471:237471 [0] NCCL INFO NET/Socket : Using [0]eth0:10.1.0.4<0>
171495ffc000000:237471:237471 [0] NCCL INFO Using network Socket
NCCL version 2.12.12.MSCCL.0.1+cuda11.3
171495ffc000000:237471:237535 [0] NCCL INFO init.cc:233 Cuda Host Alloc Size 4 pointer 0x203400000
171495ffc000000:237472:237472 [1] NCCL INFO Bootstrap : Using eth0:10.1.0.4<0>
171495ffc000000:237472:237472 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
171495ffc000000:237472:237472 [1] NCCL INFO Failed to open libibverbs.so[.1]
171495ffc000000:237472:237472 [1] NCCL INFO NET/Socket : Using [0]eth0:10.1.0.4<0>
171495ffc000000:237472:237472 [1] NCCL INFO Using network Socket
171495ffc000000:237472:237536 [1] NCCL INFO init.cc:233 Cuda Host Alloc Size 4 pointer 0x206800000
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_width, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_width, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_width, ignoring
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_width, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/000d3a1f-1dcf-000d-3a1f-1dcf000d3a1f is not a PCI device (vmbus). Attaching to first CPU
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/000d3a1f-1dcf-000d-3a1f-1dcf000d3a1f is not a PCI device (vmbus). Attaching to first CPU
171495ffc000000:237471:237535 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237472:237536 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237472:237536 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237471:237535 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237471:237535 [0] NCCL INFO === System : maxWidth 12.0 totalWidth 12.0 ===
Additional error info:
ERROR:torch.testing._internal.common_distributed:Caught exception:
Traceback (most recent call last):
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_distributed.py", line 601, in run_test
getattr(self, test_name)()
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_distributed.py", line 486, in wrapper
fn()
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_utils.py", line 3098, in wrapper
return func(*args, **kwargs)
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_distributed.py", line 131, in wrapper
return func(*args, **kwargs)
File "/mnt/vss/_work/1/s/test/pytorch/test/distributed/test_c10d_nccl.py", line 2867, in test_all_reduce_coalesced_nccl
self.assertEqual(t, torch.full_like(t, self.world_size * (i + (self.world_size + 1.) / 2.)))
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_utils.py", line 2121, in assertEqual
assert_equal(
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_comparison.py", line 1080, in assert_equal
raise error_metas[0].to_error()
AssertionError: Tensor-likes are not close!
Mismatched elements: 60 / 60 (100.0%)
Greatest absolute difference: 1.0 at index 0 (up to 1e-05 allowed)
Greatest relative difference: 0.3333333333333333 at index 0 (up to 1.3e-06 allowed)
exiting process 1 with exit code: 10
The text was updated successfully, but these errors were encountered:
I believe this error was fixed, but there is another error which still remains and might be related to the hardware capacity of the machine.
Logs:
ERROR:torch.testing._internal.common_distributed:Caught exception:
Traceback (most recent call last):
File "/mnt/vss/_work/1/s/test/pytorch/test/distributed/test_c10d_nccl.py", line 996, in test_nccl_propagate_error_reason
pg.allreduce([torch.ones(2).cuda(self.rank)]).wait()
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from memcpy_and_sync at /mnt/vss/_work/1/s/test/pytorch/c10/cuda/CUDAFunctions.h:75 (most recent call first):
While running the Multi-gpu Pytorch tests, test_all_reduce_coalesced_nccl is failing in pytorch/test/test_c10d_nccl.py. It seems like the error is coming because of inconsistent results from allreduce. The information on the logs is as follows:
171495ffc000000:237471:237471 [0] NCCL INFO Bootstrap : Using eth0:10.1.0.4<0>
171495ffc000000:237471:237471 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
171495ffc000000:237471:237471 [0] NCCL INFO Failed to open libibverbs.so[.1]
171495ffc000000:237471:237471 [0] NCCL INFO NET/Socket : Using [0]eth0:10.1.0.4<0>
171495ffc000000:237471:237471 [0] NCCL INFO Using network Socket
NCCL version 2.12.12.MSCCL.0.1+cuda11.3
171495ffc000000:237471:237535 [0] NCCL INFO init.cc:233 Cuda Host Alloc Size 4 pointer 0x203400000
171495ffc000000:237472:237472 [1] NCCL INFO Bootstrap : Using eth0:10.1.0.4<0>
171495ffc000000:237472:237472 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
171495ffc000000:237472:237472 [1] NCCL INFO Failed to open libibverbs.so[.1]
171495ffc000000:237472:237472 [1] NCCL INFO NET/Socket : Using [0]eth0:10.1.0.4<0>
171495ffc000000:237472:237472 [1] NCCL INFO Using network Socket
171495ffc000000:237472:237536 [1] NCCL INFO init.cc:233 Cuda Host Alloc Size 4 pointer 0x206800000
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_width, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_width, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_width, ignoring
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_width, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/000d3a1f-1dcf-000d-3a1f-1dcf000d3a1f is not a PCI device (vmbus). Attaching to first CPU
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/000d3a1f-1dcf-000d-3a1f-1dcf000d3a1f is not a PCI device (vmbus). Attaching to first CPU
171495ffc000000:237471:237535 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237472:237536 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237472:237536 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237471:237535 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237471:237535 [0] NCCL INFO === System : maxWidth 12.0 totalWidth 12.0 ===
Additional error info:
ERROR:torch.testing._internal.common_distributed:Caught exception:
Traceback (most recent call last):
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_distributed.py", line 601, in run_test
getattr(self, test_name)()
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_distributed.py", line 486, in wrapper
fn()
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_utils.py", line 3098, in wrapper
return func(*args, **kwargs)
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_distributed.py", line 131, in wrapper
return func(*args, **kwargs)
File "/mnt/vss/_work/1/s/test/pytorch/test/distributed/test_c10d_nccl.py", line 2867, in test_all_reduce_coalesced_nccl
self.assertEqual(t, torch.full_like(t, self.world_size * (i + (self.world_size + 1.) / 2.)))
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_utils.py", line 2121, in assertEqual
assert_equal(
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_comparison.py", line 1080, in assert_equal
raise error_metas[0].to_error()
AssertionError: Tensor-likes are not close!
Mismatched elements: 60 / 60 (100.0%)
Greatest absolute difference: 1.0 at index 0 (up to 1e-05 allowed)
Greatest relative difference: 0.3333333333333333 at index 0 (up to 1.3e-06 allowed)
exiting process 1 with exit code: 10
The text was updated successfully, but these errors were encountered: