You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
use fpn_repvit_m1_1_ade20k_40k.py train data in val appear this error:
mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
mmseg - INFO - Iter [50/40000] lr: 1.998e-04, eta: 1:35:52, time: 0.144, data_time: 0.003, memory: 8914, decode.loss_ce: 0.6784, decode.acc_seg: 83.5116, loss: 0.6784
mmseg - INFO - Iter [100/40000] lr: 1.996e-04, eta: 1:06:16, time: 0.055, data_time: 0.002, memory: 8914, decode.loss_ce: 0.3194, decode.acc_seg: 88.5569, loss: 0.3194
[>>>>>>> ] 149/1000, 19.7 task/s, elapsed: 8s, ETA: 43s/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 22 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 31435) of binary: /home/hyq/anaconda3/envs/SCTNet/bin/python
Traceback (most recent call last):
File "/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in
main()
File "/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
use fpn_repvit_m1_1_ade20k_40k.py train data in val appear this error:
mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
mmseg - INFO - Iter [50/40000] lr: 1.998e-04, eta: 1:35:52, time: 0.144, data_time: 0.003, memory: 8914, decode.loss_ce: 0.6784, decode.acc_seg: 83.5116, loss: 0.6784
mmseg - INFO - Iter [100/40000] lr: 1.996e-04, eta: 1:06:16, time: 0.055, data_time: 0.002, memory: 8914, decode.loss_ce: 0.3194, decode.acc_seg: 88.5569, loss: 0.3194
[>>>>>>> ] 149/1000, 19.7 task/s, elapsed: 8s, ETA: 43s/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 22 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 31435) of binary: /home/hyq/anaconda3/envs/SCTNet/bin/python
Traceback (most recent call last):
File "/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in
main()
File "/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hyq/anaconda3/envs/SCTNet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/media/hyq/西部数据2TB/RepViT/segmentation/tools/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-06-28_18:05:26
host : hyq-MS-7D36
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 31435)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 31435
The text was updated successfully, but these errors were encountered: