Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I have a question about running code. There is an error when running the command torchrun --nproc_per_node=2 scripts/sdxl_example.py. My torch version is 2.2.1, cuda version is 11.8, and python version is 3.10. #13

Open
CharvinMei opened this issue Jun 12, 2024 · 4 comments

Comments

@CharvinMei
Copy link

[rank1]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
terminate called after throwing an instance of 'std::runtime_error'
what(): terminate called after throwing an instance of 'NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)std::runtime_error
'
what(): NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[2024-06-12 03:45:51,447] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 10453) of binary: /root/anaconda3/envs/distrifuser/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/distrifuser/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/sdxl_example.py FAILED

Failures:
[1]:
time : 2024-06-12_03:45:51
host : 692d3f5c0349
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 10454)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 10454

Root Cause (first observed failure):
[0]:
time : 2024-06-12_03:45:51
host : 692d3f5c0349
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 10453)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 10453

@lmxyy
Copy link
Collaborator

lmxyy commented Jun 12, 2024

Looks like it is some torchrun and NCCL issue. Are you able to run it with a single GPU?

@CharvinMei
Copy link
Author

Yes, I can run it with a single GPU. But when it’s set up for two GPUs, an error occurs.

@lmxyy
Copy link
Collaborator

lmxyy commented Jun 12, 2024

Weird. Could you try disabling the CUDAGraph to see if it works. You can simply pass use_cuda_graph=False here.

@CharvinMei
Copy link
Author

After changing the setting, I found that the error has become the following situation.

[rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank1]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[2024-07-08 06:10:31,482] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 99442) of binary: /home/meichangwang/miniconda3/envs/distrifuser/bin/python
Traceback (most recent call last):
File "/home/meichangwang/miniconda3/envs/distrifuser/bin/torchrun", line 8, in
sys.exit(main())
File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/sdxl_example.py FAILED

Failures:
[1]:
time : 2024-07-08_06:10:31
host : ubuntu
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 99443)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 99443

Root Cause (first observed failure):
[0]:
time : 2024-07-08_06:10:31
host : ubuntu
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 99442)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 99442

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants