Can't install requiremwnts.txt #3

Zerycii · 2024-10-16T07:27:12Z

Always have this error even I change the chanel.

nhduong · 2024-10-16T07:43:41Z

Sorry for the inconvenience. You might want to consider one of the following approaches:

Creating a basic environment and installing additional packages based on the errors you get.
Running the code with Docker. First, you pull a Docker image for PyTorch, e.g., pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel, then you can install additional packages.

Zerycii · 2024-10-20T08:30:57Z

Sorry for the inconvenience. You might want to consider one of the following approaches:

Creating a basic environment and installing additional packages based on the errors you get.

Running the code with Docker. First, you pull a Docker image for PyTorch, e.g., pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel, then you can install additional packages.

thanks for your reply and outstanding work! I try to install them one by one,But I don't know how to insatll cuml and apex.I tried many ways but no effect.Could you help me and if I don't install cuml and apex, will be a error?

nhduong · 2024-10-20T08:49:03Z

Thank you for your interest in our work.

We used apex to speed up the training with automatic mixed precision. So without apex, the code is still supposed to run normally. All you need to do is modifying mixed_precision: fp16 to mixed_precision: no in default_config.yaml.

If the above options do not help, you might want to take a look at NVIDIA DGL containers, e.g., https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-05.html, which also include apex.

Zerycii · 2024-10-24T15:20:47Z

/home/lmh/.local/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libc10_cuda.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

Tensorboard logs saved to: /data1/lzq/results_guided/outputs/spl/Moire/[spl][Moire][rev_1][2024-10-24][23.11.26.736310][GPU_1][amax]/tb/train
Tensorboard logs saved to: /data1/lzq/results_guided/outputs/spl/Moire/[spl][Moire][rev_1][2024-10-24][23.11.26.736310][GPU_1][amax]/tb/val
Traceback (most recent call last):
File "/home/lmh/.local/bin/accelerate", line 8, in
sys.exit(main())
File "/home/lmh/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/lmh/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1168, in launch_command
simple_launcher(args)
File "/home/lmh/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 763, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/anaconda3/bin/python', 'main.py', '--affine', '--l1loss', '--adaloss', '--perloss', '--dont_calc_mets_at_all', '--log2file', '--data_path', '/data1/lzq/Moire_512', '--data_name', 'Moire', '--train_dir', 'train', '--test_dir', 'val', '--moire_dir', 'moire', '--clean_dir', 'clear', '--batch_size', '2', '--T_0', '50', '--epochs', '100', '--init_weights']' returned non-zero exit status 1.
Exucuse me. I always meet this error.My python is Python 3.8.13 and torch/torchvision version is as following.I install torchvision and torch many times but the error is still here.

nhduong · 2024-10-25T01:45:41Z

Since --log2file is set, could you please take a look at the .log file in the output folder for more information?

Zerycii · 2024-10-28T10:06:14Z

Since --log2file is set, could you please take a look at the .log file in the output folder for more information?

thanks for your help. I use the instruction to train model:python main.py --affine --l1loss --adaloss --perloss --dont_calc_mets_at_all --log2file
--data_path "path_to/aim2019_demoireing_track1"
--data_name aim --train_dir "train" --test_dir "val" --moire_dir "moire" --clean_dir "clear"
--batch_size 2 --T_0 50 --epochs 200 --init_weights.
but threre is another question:

when I finished the train,I found that PSNR= 0 SSIM =0

nhduong · 2024-10-28T10:17:30Z

That is because --dont_calc_mets_at_all was used to skip evaluation steps (which heavily rely on CPU) to speed up the training. You can follow the evaluation step in the instructions to get PSNR and SSIM.

Zerycii · 2024-10-28T13:18:13Z

thanks very much. I'm grateful your heartwarm help and your outstanding work!!!

nhduong closed this as completed Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't install requiremwnts.txt #3

Can't install requiremwnts.txt #3

Zerycii commented Oct 16, 2024

nhduong commented Oct 16, 2024

Zerycii commented Oct 20, 2024

nhduong commented Oct 20, 2024

Zerycii commented Oct 24, 2024

nhduong commented Oct 25, 2024

Zerycii commented Oct 28, 2024

nhduong commented Oct 28, 2024

Zerycii commented Oct 28, 2024

Can't install requiremwnts.txt #3

Can't install requiremwnts.txt #3

Comments

Zerycii commented Oct 16, 2024

nhduong commented Oct 16, 2024

Zerycii commented Oct 20, 2024

nhduong commented Oct 20, 2024

Zerycii commented Oct 24, 2024

nhduong commented Oct 25, 2024

Zerycii commented Oct 28, 2024

nhduong commented Oct 28, 2024

Zerycii commented Oct 28, 2024