Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't install requiremwnts.txt #3

Closed
Zerycii opened this issue Oct 16, 2024 · 8 comments
Closed

Can't install requiremwnts.txt #3

Zerycii opened this issue Oct 16, 2024 · 8 comments

Comments

@Zerycii
Copy link

Zerycii commented Oct 16, 2024

image
Always have this error even I change the chanel.

@nhduong
Copy link
Owner

nhduong commented Oct 16, 2024

Sorry for the inconvenience. You might want to consider one of the following approaches:

  • Creating a basic environment and installing additional packages based on the errors you get.
  • Running the code with Docker. First, you pull a Docker image for PyTorch, e.g., pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel, then you can install additional packages.

@Zerycii
Copy link
Author

Zerycii commented Oct 20, 2024

Sorry for the inconvenience. You might want to consider one of the following approaches:

  • Creating a basic environment and installing additional packages based on the errors you get.
  • Running the code with Docker. First, you pull a Docker image for PyTorch, e.g., pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel, then you can install additional packages.

thanks for your reply and outstanding work! I try to install them one by one,But I don't know how to insatll cuml and apex.I tried many ways but no effect.Could you help me and if I don't install cuml and apex, will be a error?

@nhduong
Copy link
Owner

nhduong commented Oct 20, 2024

Thank you for your interest in our work.

We used apex to speed up the training with automatic mixed precision. So without apex, the code is still supposed to run normally. All you need to do is modifying mixed_precision: fp16 to mixed_precision: no in default_config.yaml.

If the above options do not help, you might want to take a look at NVIDIA DGL containers, e.g., https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-05.html, which also include apex.

@nhduong nhduong closed this as completed Oct 23, 2024
@Zerycii
Copy link
Author

Zerycii commented Oct 24, 2024

/home/lmh/.local/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libc10_cuda.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

Tensorboard logs saved to: /data1/lzq/results_guided/outputs/spl/Moire/[spl][Moire][rev_1][2024-10-24][23.11.26.736310][GPU_1][amax]/tb/train
Tensorboard logs saved to: /data1/lzq/results_guided/outputs/spl/Moire/[spl][Moire][rev_1][2024-10-24][23.11.26.736310][GPU_1][amax]/tb/val
Traceback (most recent call last):
File "/home/lmh/.local/bin/accelerate", line 8, in
sys.exit(main())
File "/home/lmh/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/lmh/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1168, in launch_command
simple_launcher(args)
File "/home/lmh/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 763, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/anaconda3/bin/python', 'main.py', '--affine', '--l1loss', '--adaloss', '--perloss', '--dont_calc_mets_at_all', '--log2file', '--data_path', '/data1/lzq/Moire_512', '--data_name', 'Moire', '--train_dir', 'train', '--test_dir', 'val', '--moire_dir', 'moire', '--clean_dir', 'clear', '--batch_size', '2', '--T_0', '50', '--epochs', '100', '--init_weights']' returned non-zero exit status 1.
Exucuse me. I always meet this error.My python is Python 3.8.13 and torch/torchvision version is as following.I install torchvision and torch many times but the error is still here.

image

@nhduong
Copy link
Owner

nhduong commented Oct 25, 2024

Since --log2file is set, could you please take a look at the .log file in the output folder for more information?

@Zerycii
Copy link
Author

Zerycii commented Oct 28, 2024

Since --log2file is set, could you please take a look at the .log file in the output folder for more information?

thanks for your help. I use the instruction to train model:python main.py --affine --l1loss --adaloss --perloss --dont_calc_mets_at_all --log2file
--data_path "path_to/aim2019_demoireing_track1"
--data_name aim --train_dir "train" --test_dir "val" --moire_dir "moire" --clean_dir "clear"
--batch_size 2 --T_0 50 --epochs 200 --init_weights.
but threre is another question:

image
when I finished the train,I found that PSNR= 0 SSIM =0

@nhduong
Copy link
Owner

nhduong commented Oct 28, 2024

That is because --dont_calc_mets_at_all was used to skip evaluation steps (which heavily rely on CPU) to speed up the training. You can follow the evaluation step in the instructions to get PSNR and SSIM.

@Zerycii
Copy link
Author

Zerycii commented Oct 28, 2024

thanks very much. I'm grateful your heartwarm help and your outstanding work!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants