Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss=nan #5

Closed
Zerycii opened this issue Nov 22, 2024 · 17 comments
Closed

loss=nan #5

Zerycii opened this issue Nov 22, 2024 · 17 comments

Comments

@Zerycii
Copy link

Zerycii commented Nov 22, 2024

Uploading f49804c6b3eaf247aaec13e81a3525c.png…

Thanks for your eminent work.But it seems that I met a problem.I don't know why the loss is nan.Although I added Normalization, it was useless. Sincerely thank you for your help!

@Zerycii
Copy link
Author

Zerycii commented Nov 22, 2024

f49804c6b3eaf247aaec13e81a3525c
This is the screenshot

@nhduong
Copy link
Owner

nhduong commented Nov 22, 2024

We've recently been informed of the same problem. In most cases, it is because of training using AMP with FP16. Could you please change the following line
https://github.com/nhduong/guided_demoireing_net/blob/3eb8fb1c3dd45f4a9bf4ca38acbce91b48135672/default_config.yaml#L7C1-L7C16
to mixed_precision: 'no' and give it another try. We are sorry for the inconvenience.

@Zerycii
Copy link
Author

Zerycii commented Nov 22, 2024

We've recently been informed of the same problem. In most cases, it is because of training using AMP with FP16. Could you please change the following line https://github.com/nhduong/guided_demoireing_net/blob/3eb8fb1c3dd45f4a9bf4ca38acbce91b48135672/default_config.yaml#L7C1-L7C16 to mixed_precision: 'no' and give it another try. We are sorry for the inconvenience.

Thanks for your reply. but i didn't use the accelerate and the config.yaml file.
CUDA_VISIBLE_DEVICES="0" python /home/lmh/lzq/guided_demoireing_net/main.py --affine --l1loss --adaloss --perloss --evaluate --log2file --data_path "/data1/lzq/Moire_512" --data_name Moire --train_dir "train" --test_dir "val" --moire_dir "moire" --clean_dir "clear" --batch_size 2 --T_0 50 --epochs 100 --init_weights

should I must use it?

@nhduong
Copy link
Owner

nhduong commented Nov 22, 2024

Yes, it is recommended to use Hugging Face Accelerator with our code. In case you do not want to utilize it, please take a look at the current settings of Hugging Face Accelerator by following this tutorial https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-config. Please make sure that FP16 is not being used for your training.

@Zerycii
Copy link
Author

Zerycii commented Nov 22, 2024

I used it but there is a error. but I have already installed torchvision.
Traceback (most recent call last):
File "/home/lmh/lzq/guided_demoireing_net/main.py", line 19, in
import torchvision
ModuleNotFoundError: No module named 'torchvision'
Traceback (most recent call last):
File "/home/lmh/.local/bin/accelerate", line 8, in
sys.exit(main())
File "/home/lmh/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/lmh/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1168, in launch_command
simple_launcher(args)
File "/home/lmh/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 763, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/anaconda3/bin/python', 'main.py', '--affine', '--l1loss', '--adaloss', '--perloss', '--dont_calc_mets_at_all', '--log2file', '--data_path', '/data1/lzq/Moire_512', '--data_name', 'aim', '--train_dir', 'train', '--test_dir', 'val', '--moire_dir', 'moire', '--clean_dir', 'clear', '--batch_size', '2', '--T_0', '50', '--epochs', '100', '--init_weights']' returned non-zero exit status 1.

@nhduong
Copy link
Owner

nhduong commented Nov 22, 2024

I am not sure about this but maybe a clean installation of the Anaconda environment might help.

@Zerycii
Copy link
Author

Zerycii commented Nov 23, 2024

I changed the version of accelerate==0.18.0.
image
I used this instruction to run the code,but the ssim and psnr are 0.
CUDA_VISIBLE_DEVICES="0,1,2" nohup accelerate launch --config_file default_config.yaml main.py --affine --l1loss --adaloss --perloss --evaluate --log2file --data_path "/data1/lzq/Moire_512" --data_name aim --train_dir "train" --test_dir "val" --moire_dir "moire" --clean_dir "clear" --batch_size 2 --T_0 50 --epochs 100 --init_weights.
I don't know why it is zero.

@nhduong
Copy link
Owner

nhduong commented Nov 23, 2024

Is it related to this problem?

@Zerycii
Copy link
Author

Zerycii commented Nov 24, 2024

Is it related to this problem?
Here is the args:
Namespace(data_path='/data1/lzq/Moire_512', train_dir='train', test_dir='val', moire_dir='moire', clean_dir='clear', data_name='aim', exp_name='spl', note='rev_1', adaloss=True, affine=True, l1loss=True, perloss=True, workers=4, epochs=100, start_epoch=0, batch_size=2, test_batch_size=1, lr=0.0002, eta_min=1e-06, ada_lamb=5.0, ada_eps=1.0, ada_eps_2=1.0, num_branches=3, init_weights=True, T_0=50, print_freq=1000, resume='', evaluate=True, calc_mets=False, calc_val_losses=False, calc_train_mets=False, dont_calc_mets_at_all=False, dont_calc_train_mets=False, log2file=True, seed=123)
you can see that I set the evaluate=True

@nhduong
Copy link
Owner

nhduong commented Nov 24, 2024

I meant the problem with ski_ssim() function in the above post.

@Zerycii
Copy link
Author

Zerycii commented Nov 25, 2024

I meant the problem with ski_ssim() function in the above post.

Thanks for your help, It seems that the ski_ssim() function has the problem.I changed the function and make it!
But when in the 50th epoch,the loss=nan again.Here is my config_file, is any problem in it?
image

@nhduong
Copy link
Owner

nhduong commented Nov 25, 2024

It looks fine to me. Could you please identify which loss function causes the error?

@Zerycii
Copy link
Author

Zerycii commented Nov 25, 2024

It looks fine to me. Could you please identify which loss function causes the error?
I think all loss function = nan
image

@nhduong
Copy link
Owner

nhduong commented Nov 25, 2024

Sorry about this inconvenience. To be honest, we have no clue why this happens. Could you please follow this post for error traceback? Thank you.

@Zerycii
Copy link
Author

Zerycii commented Nov 25, 2024

Sorry about this inconvenience. To be honest, we have no clue why this happens. Could you please follow this post for error traceback? Thank you.
ok thanks for your reply. I will try it.

@nhduong
Copy link
Owner

nhduong commented Nov 28, 2024

Have you solved the problem? Based on recent comments we received, loss can be NaN when fp16 is used. So could you please double-check the datatype again by following this post? Thank you.

@Zerycii
Copy link
Author

Zerycii commented Nov 28, 2024

Have you solved the problem? Based on recent comments we received, loss can be NaN when fp16 is used. So could you please double-check the datatype again by following this post? Thank you.

I print the type and all are float32.Unfortunately I have not solved this problem.
image

@nhduong nhduong closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants