We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Environment:
ucx: 1.12.1 ucc: 1.0.0 cuda: cuda11.7 gcc: gcc-9.4.0 pytorch: nightly open mpi: 4.1.2rc4
The command start_test.sh torch_pt2pt_test.py --backend ucc deadlocks. After waiting a while the following lo shows up
start_test.sh torch_pt2pt_test.py --backend ucc
Traceback (most recent call last): Traceback (most recent call last): File "torch_pt2pt_test.py", line 32, in <module> File "torch_pt2pt_test.py", line 28, in <module> dist.recv(tensor_test, src=0, tag=0, group=pg) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1007, in recv dist.send(tensor_test, dst=dst + 1, tag=0, group=pg) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 964, in send pg.recv([tensor], group_src_rank, tag).wait() group.send([tensor], group_dst_rank, tag).wait() RuntimeError: Socket Timeout RuntimeError: Socket Timeout
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Environment:
The command
start_test.sh torch_pt2pt_test.py --backend ucc
deadlocks. After waiting a while the following lo shows upThe text was updated successfully, but these errors were encountered: