Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用openke2.0中的train_rotate_FB15K237_dist.py进行分布式训练时报错 #410

Open
pipiyapi opened this issue Jun 17, 2024 · 1 comment

Comments

@pipiyapi
Copy link

你好,我在使用openke2.0中的train_rotate_FB15K237_dist.py时出现以下报错,请问有什么解决办法吗?非常希望得到帮助。
Input Files Path : ./benchmarks/data-390/
The toolkit is importing datasets.
The total of relations is 28.
The total of entities is 700324.
Input Files Path : ./benchmarks/data-390/
The toolkit is importing datasets.
The total of relations is 28.
The total of entities is 700324.
The total of train triples is 2849846.
The total of train triples is 2849846.
Input Files Path : ./benchmarks/data-390/
Input Files Path : ./benchmarks/data-390/
The total of test triples is 258713.
The total of valid triples is 1293564.
The total of test triples is 258713.
The total of valid triples is 1293564.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 2646564) of binary: /home/jupyter-xingcheng/.conda/envs/openke/bin/python3.8
Traceback (most recent call last):
File "/home/jupyter-xingcheng/.conda/envs/openke/bin/torchrun", line 8, in
sys.exit(main())
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_rotate_data_390_dist.py FAILED

Failures:
[1]:
time : 2024-06-17_13:53:46
host : dell
rank : 1 (local_rank: 1)
exitcode : -11 (pid: 2646565)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 2646565

Root Cause (first observed failure):
[0]:
time : 2024-06-17_13:53:46
host : dell
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 2646564)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 2646564

运行的命令是:WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port 1234 train_rotate_data_390_dist.py

@pipiyapi
Copy link
Author

上面问题解决了,是由于我的数据有误,但分布式训练又遇到新问题,分布式只有一张卡工作,但另一张卡也是gpu满的。
(openke) jupyter-xingcheng@dell:~/OpenKE2.0$ python -m torch.distributed.launch --nproc_per_node 2 train_rotate_data_390_dist.py
/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Input Files Path : ./benchmarks/data-390/
The toolkit is importing datasets.
The total of relations is 28.
The total of entities is 700324.
Input Files Path : ./benchmarks/data-390/
The toolkit is importing datasets.
The total of relations is 28.
The total of entities is 700324.
The total of train triples is 2849846.
The total of train triples is 2849846.
Input Files Path : ./benchmarks/data-390/
Input Files Path : ./benchmarks/data-390/
The total of test triples is 258712.
The total of valid triples is 1293564.
The total of test triples is 258712.
The total of valid triples is 1293564.
Finish initializing...
0%| | 0/6000 [00:00<?, ?it/s]Finish initializing...
Epoch 0 | loss: 1141.047029: 0%| | 1/6000 [03:04<307:32:46, 184.56s/it

以下是nvidi-smi使用情况:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:3B:00.0 Off | N/A |
| 70% 86C P2 297W / 350W | 22428MiB / 24576MiB | 89% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 Off | 00000000:D8:00.0 Off | N/A |
| 88% 88C P2 278W / 350W | 22428MiB / 24576MiB | 89% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3311491 C ...cheng/.conda/envs/openke/bin/python 22422MiB |
| 1 N/A N/A 3311492 C ...cheng/.conda/envs/openke/bin/python 22422MiB |
+-----------------------------------------------------------------------------------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant