Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

我在切换骨干网络时遇到一些问题 #165

Open
zhenzi0322 opened this issue Jan 18, 2025 · 6 comments
Open

我在切换骨干网络时遇到一些问题 #165

zhenzi0322 opened this issue Jan 18, 2025 · 6 comments

Comments

@zhenzi0322
Copy link

我通过config.py中将骨干网络切换成了swin_v1_t。同时将self.size修改成了(2240, 2240)

从头开始训练时,训练到epoch=10时出现问题:

[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                                 
[rank1]: RuntimeError: CUDA error: an illegal memory access was encountered                                                                                                                        
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                                   
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1                                   
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. 

请教一下。

@ZhengPeng7
Copy link
Owner

可以使用CPU看看, CUDA的报错经常似是而非. 我觉得应该是显存OOM了, 可以调小size跑下试试看.

@zhenzi0322
Copy link
Author

用的batch_size=1,已经不能再调整了。

@ZhengPeng7
Copy link
Owner

我说的是size调小啊, 不是batch size.

@zhenzi0322
Copy link
Author

尺寸也调小了。会报这种错:

operator(): block: [3508,0,0], thread: [100,0,0] Assertion `input_val >= zero && input_val <= one` failed.
Traceback (most recent call last):
  File "/data/ssd/ai/xxx/train.py", line 250, in <module>
    main()
  File "/data/ssd/ai/xxx/train.py", line 236, in main
    train_loss = trainer.train_epoch(epoch)
  File "/data/ssd/ai/xxx/train.py", line 213, in train_epoch
    self._train_batch(batch)
  File "/data/ssd/ai/xxx/train.py", line 182, in _train_batch
    self.loss_dict['loss_pix'] = loss_pix.item()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@zhenzi0322
Copy link
Author

loss会不会在训着训着突然变成了man,然后就出现了我上面所说的问题情况

@ZhengPeng7
Copy link
Owner

未调节backbone时没有这个错误么?
如果也有, 那其实和backbone无关呀...
一般来说, 这个问题可能是GT的范围不在[0~1]内, 你可以检查下自己的数据.
作为对照, 可以比如使用DIS5K训练一下看看, 如果训它没问题, 就基本确认是新数据标注的范围的问题了.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants