loss=nan #5

Zerycii · 2024-11-22T12:14:05Z

Thanks for your eminent work.But it seems that I met a problem.I don't know why the loss is nan.Although I added Normalization, it was useless. Sincerely thank you for your help!

Zerycii · 2024-11-22T12:18:25Z

This is the screenshot

nhduong · 2024-11-22T12:24:24Z

We've recently been informed of the same problem. In most cases, it is because of training using AMP with FP16. Could you please change the following line
https://github.com/nhduong/guided_demoireing_net/blob/3eb8fb1c3dd45f4a9bf4ca38acbce91b48135672/default_config.yaml#L7C1-L7C16
to mixed_precision: 'no' and give it another try. We are sorry for the inconvenience.

Zerycii · 2024-11-22T12:53:12Z

We've recently been informed of the same problem. In most cases, it is because of training using AMP with FP16. Could you please change the following line https://github.com/nhduong/guided_demoireing_net/blob/3eb8fb1c3dd45f4a9bf4ca38acbce91b48135672/default_config.yaml#L7C1-L7C16 to mixed_precision: 'no' and give it another try. We are sorry for the inconvenience.

Thanks for your reply. but i didn't use the accelerate and the config.yaml file.
CUDA_VISIBLE_DEVICES="0" python /home/lmh/lzq/guided_demoireing_net/main.py --affine --l1loss --adaloss --perloss --evaluate --log2file --data_path "/data1/lzq/Moire_512" --data_name Moire --train_dir "train" --test_dir "val" --moire_dir "moire" --clean_dir "clear" --batch_size 2 --T_0 50 --epochs 100 --init_weights

should I must use it？

nhduong · 2024-11-22T13:05:35Z

Yes, it is recommended to use Hugging Face Accelerator with our code. In case you do not want to utilize it, please take a look at the current settings of Hugging Face Accelerator by following this tutorial https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-config. Please make sure that FP16 is not being used for your training.

Zerycii · 2024-11-22T13:39:24Z

I used it but there is a error. but I have already installed torchvision.
Traceback (most recent call last):
File "/home/lmh/lzq/guided_demoireing_net/main.py", line 19, in
import torchvision
ModuleNotFoundError: No module named 'torchvision'
Traceback (most recent call last):
File "/home/lmh/.local/bin/accelerate", line 8, in
sys.exit(main())
File "/home/lmh/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/lmh/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1168, in launch_command
simple_launcher(args)
File "/home/lmh/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 763, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/anaconda3/bin/python', 'main.py', '--affine', '--l1loss', '--adaloss', '--perloss', '--dont_calc_mets_at_all', '--log2file', '--data_path', '/data1/lzq/Moire_512', '--data_name', 'aim', '--train_dir', 'train', '--test_dir', 'val', '--moire_dir', 'moire', '--clean_dir', 'clear', '--batch_size', '2', '--T_0', '50', '--epochs', '100', '--init_weights']' returned non-zero exit status 1.

nhduong · 2024-11-22T13:48:44Z

I am not sure about this but maybe a clean installation of the Anaconda environment might help.

Zerycii · 2024-11-23T18:50:54Z

I changed the version of accelerate==0.18.0.

I used this instruction to run the code,but the ssim and psnr are 0.
CUDA_VISIBLE_DEVICES="0,1,2" nohup accelerate launch --config_file default_config.yaml main.py --affine --l1loss --adaloss --perloss --evaluate --log2file --data_path "/data1/lzq/Moire_512" --data_name aim --train_dir "train" --test_dir "val" --moire_dir "moire" --clean_dir "clear" --batch_size 2 --T_0 50 --epochs 100 --init_weights.
I don't know why it is zero.

nhduong · 2024-11-23T23:18:38Z

Is it related to this problem?

Zerycii · 2024-11-24T05:35:48Z

Is it related to this problem?
Here is the args:
Namespace(data_path='/data1/lzq/Moire_512', train_dir='train', test_dir='val', moire_dir='moire', clean_dir='clear', data_name='aim', exp_name='spl', note='rev_1', adaloss=True, affine=True, l1loss=True, perloss=True, workers=4, epochs=100, start_epoch=0, batch_size=2, test_batch_size=1, lr=0.0002, eta_min=1e-06, ada_lamb=5.0, ada_eps=1.0, ada_eps_2=1.0, num_branches=3, init_weights=True, T_0=50, print_freq=1000, resume='', evaluate=True, calc_mets=False, calc_val_losses=False, calc_train_mets=False, dont_calc_mets_at_all=False, dont_calc_train_mets=False, log2file=True, seed=123)
you can see that I set the evaluate=True

nhduong · 2024-11-24T06:39:52Z

I meant the problem with ski_ssim() function in the above post.

Zerycii · 2024-11-25T05:14:44Z

I meant the problem with ski_ssim() function in the above post.

Thanks for your help, It seems that the ski_ssim() function has the problem.I changed the function and make it!
But when in the 50th epoch,the loss=nan again.Here is my config_file, is any problem in it?

nhduong · 2024-11-25T05:21:08Z

It looks fine to me. Could you please identify which loss function causes the error?

Zerycii · 2024-11-25T05:52:40Z

It looks fine to me. Could you please identify which loss function causes the error?
I think all loss function = nan

nhduong · 2024-11-25T06:00:11Z

Sorry about this inconvenience. To be honest, we have no clue why this happens. Could you please follow this post for error traceback? Thank you.

Zerycii · 2024-11-25T07:11:49Z

Sorry about this inconvenience. To be honest, we have no clue why this happens. Could you please follow this post for error traceback? Thank you.
ok thanks for your reply. I will try it.

nhduong · 2024-11-28T01:02:33Z

Have you solved the problem? Based on recent comments we received, loss can be NaN when fp16 is used. So could you please double-check the datatype again by following this post? Thank you.

Zerycii · 2024-11-28T05:42:14Z

Have you solved the problem? Based on recent comments we received, loss can be NaN when fp16 is used. So could you please double-check the datatype again by following this post? Thank you.

I print the type and all are float32.Unfortunately I have not solved this problem.

nhduong closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss=nan #5

loss=nan #5

Zerycii commented Nov 22, 2024 •

edited

Loading

Zerycii commented Nov 22, 2024

nhduong commented Nov 22, 2024

Zerycii commented Nov 22, 2024

nhduong commented Nov 22, 2024

Zerycii commented Nov 22, 2024

nhduong commented Nov 22, 2024

Zerycii commented Nov 23, 2024

nhduong commented Nov 23, 2024

Zerycii commented Nov 24, 2024

nhduong commented Nov 24, 2024

Zerycii commented Nov 25, 2024

nhduong commented Nov 25, 2024

Zerycii commented Nov 25, 2024

nhduong commented Nov 25, 2024

Zerycii commented Nov 25, 2024

nhduong commented Nov 28, 2024 •

edited

Loading

Zerycii commented Nov 28, 2024

loss=nan #5

loss=nan #5

Comments

Zerycii commented Nov 22, 2024 • edited Loading

Zerycii commented Nov 22, 2024

nhduong commented Nov 22, 2024

Zerycii commented Nov 22, 2024

nhduong commented Nov 22, 2024

Zerycii commented Nov 22, 2024

nhduong commented Nov 22, 2024

Zerycii commented Nov 23, 2024

nhduong commented Nov 23, 2024

Zerycii commented Nov 24, 2024

nhduong commented Nov 24, 2024

Zerycii commented Nov 25, 2024

nhduong commented Nov 25, 2024

Zerycii commented Nov 25, 2024

nhduong commented Nov 25, 2024

Zerycii commented Nov 25, 2024

nhduong commented Nov 28, 2024 • edited Loading

Zerycii commented Nov 28, 2024

Zerycii commented Nov 22, 2024 •

edited

Loading

nhduong commented Nov 28, 2024 •

edited

Loading