You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question about running code. There is an error when running the command torchrun --nproc_per_node=2 scripts/sdxl_example.py. My torch version is 2.2.1, cuda version is 11.8, and python version is 3.10.
#13
Open
CharvinMei opened this issue
Jun 12, 2024
· 4 comments
[rank1]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
terminate called after throwing an instance of 'std::runtime_error'
what(): terminate called after throwing an instance of 'NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)std::runtime_error
'
what(): NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[2024-06-12 03:45:51,447] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 10453) of binary: /root/anaconda3/envs/distrifuser/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/distrifuser/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/sdxl_example.py FAILED
Failures:
[1]:
time : 2024-06-12_03:45:51
host : 692d3f5c0349
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 10454)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 10454
Root Cause (first observed failure):
[0]:
time : 2024-06-12_03:45:51
host : 692d3f5c0349
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 10453)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 10453
The text was updated successfully, but these errors were encountered:
After changing the setting, I found that the error has become the following situation.
“
[rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank1]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[2024-07-08 06:10:31,482] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 99442) of binary: /home/meichangwang/miniconda3/envs/distrifuser/bin/python
Traceback (most recent call last):
File "/home/meichangwang/miniconda3/envs/distrifuser/bin/torchrun", line 8, in
sys.exit(main())
File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/sdxl_example.py FAILED
Failures:
[1]:
time : 2024-07-08_06:10:31
host : ubuntu
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 99443)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 99443
Root Cause (first observed failure):
[0]:
time : 2024-07-08_06:10:31
host : ubuntu
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 99442)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 99442
[rank1]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
terminate called after throwing an instance of 'std::runtime_error'
what(): terminate called after throwing an instance of 'NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)std::runtime_error
'
what(): NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[2024-06-12 03:45:51,447] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 10453) of binary: /root/anaconda3/envs/distrifuser/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/distrifuser/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/sdxl_example.py FAILED
Failures:
[1]:
time : 2024-06-12_03:45:51
host : 692d3f5c0349
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 10454)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 10454
Root Cause (first observed failure):
[0]:
time : 2024-06-12_03:45:51
host : 692d3f5c0349
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 10453)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 10453
The text was updated successfully, but these errors were encountered: