Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diffusion-nodes can fail loading the checkpoint on some rank #281

Open
Delaunay opened this issue Sep 11, 2024 · 6 comments
Open

diffusion-nodes can fail loading the checkpoint on some rank #281

Delaunay opened this issue Sep 11, 2024 · 6 comments

Comments

@Delaunay
Copy link
Collaborator

Delaunay commented Sep 11, 2024

diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]: Traceback (most recent call last):
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:   File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/diffusers/models/model_loading_utils.py", line 
105, in load_state_dict
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:     return safetensors.torch.load_file(checkpoint_file, device="cpu")
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:   File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/safetensors/torch.py", line 313, in load_file
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:     with safe_open(filename, framework="pt", device=device) as f:
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]: FileNotFoundError: No such file or directory: 
"/tmp/workspace/cuda/results/cache/huggingface/hub/models--stabilityai--stable-diffusion-2/snapshots/1e128c8891e52218b74cde8f26dbfc701cb99d79/vae/diffusion_pytorch_model.safetensors"
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] 
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]: During handling of the above exception, another exception occurred:
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] 
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]: Traceback (most recent call last):
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:   File "/home/mila/d/delaunap/milabench/benchmarks/diffusion/main.py", line 249, in <module>
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:     main()
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:   File "/home/mila/d/delaunap/milabench/benchmarks/diffusion/main.py", line 243, in main
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:     train(observer, config)
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:   File "/home/mila/d/delaunap/milabench/benchmarks/diffusion/main.py", line 159, in train
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:     encoder, vae, unet = models(accelerator, args)
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:   File "/home/mila/d/delaunap/milabench/benchmarks/diffusion/main.py", line 53, in models
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:     vae = AutoencoderKL.from_pretrained(
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:   File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114,
in _inner_fn
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:     return fn(*args, **kwargs)
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:   File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 735, 
in from_pretrained
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:     state_dict = load_state_dict(model_file, variant=variant)
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:   File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/diffusers/models/model_loading_utils.py", line 
115, in load_state_dict
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]:     with open(checkpoint_file) as f:
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank14]: FileNotFoundError: [Errno 2] No such file or directory: 
'/tmp/workspace/cuda/results/cache/huggingface/hub/models--stabilityai--stable-diffusion-2/snapshots/1e128c8891e52218b74cde8f26dbfc701cb99d79/vae/diffusion_pytorch_model.safetensors'

Note that the file does exist and was rsync to the local node beforehand.
Additionally, the error only appears on rank14, if the file was missing we would expect all rank[8-15] to print and error.

@Delaunay
Copy link
Collaborator Author

Delaunay commented Sep 11, 2024

Another run

diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]: Traceback (most recent call last):
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:   File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/diffusers/models/model_loading_utils.py", line 105, in 
load_state_dict
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:     return safetensors.torch.load_file(checkpoint_file, device="cpu")
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:   File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/safetensors/torch.py", line 313, in load_file
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:     with safe_open(filename, framework="pt", device=device) as f:
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]: FileNotFoundError: No such file or directory: 
"/tmp/workspace/cuda/results/cache/huggingface/hub/models--stabilityai--stable-diffusion-2/snapshots/1e128c8891e52218b74cde8f26dbfc701cb99d79/unet/diffusion_pytorch_model.safetensors"
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] 
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]: During handling of the above exception, another exception occurred:
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] 
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]: Traceback (most recent call last):
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:   File "/home/mila/d/delaunap/milabench/benchmarks/diffusion/main.py", line 249, in <module>
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:     main()
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:   File "/home/mila/d/delaunap/milabench/benchmarks/diffusion/main.py", line 243, in main
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:     train(observer, config)
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:   File "/home/mila/d/delaunap/milabench/benchmarks/diffusion/main.py", line 159, in train
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:     encoder, vae, unet = models(accelerator, args)
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:   File "/home/mila/d/delaunap/milabench/benchmarks/diffusion/main.py", line 57, in models
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:     unet = UNet2DConditionModel.from_pretrained(
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:   File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in 
_inner_fn
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:     return fn(*args, **kwargs)
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:   File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 735, in 
from_pretrained
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:     state_dict = load_state_dict(model_file, variant=variant)
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:   File "/tmp/workspace/cuda/results/venv/torch/lib/python3.10/site-packages/diffusers/models/model_loading_utils.py", line 115, in 
load_state_dict
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]:     with open(checkpoint_file) as f:
diffusion-nodes.cn-n001.server.mila.quebec.nolog [stderr] [rank11]: FileNotFoundError: [Errno 2] No such file or directory: 
'/tmp/workspace/cuda/results/cache/huggingface/hub/models--stabilityai--stable-diffusion-2/snapshots/1e128c8891e52218b74cde8f26dbfc701cb99d79/unet/diffusion_pytorch_model.safetensors'
delaunap@cn-n001:/tmp/workspace/cuda/results$ readlink -f /tmp/workspace/cuda/results/cache/huggingface/hub/models--stabilityai--stable-diffusion-2/snapshots/1e128c8891e52218b74cde8f26dbfc701cb99d79/unet/diffusion_pytorch_model.safetensors
/tmp/workspace/cuda/results/cache/huggingface/hub/models--stabilityai--stable-diffusion-2/blobs/df895df60d360b4b18e9c05485080e3dd35691452460c7234031b64465120d5a
delaunap@cn-n001:/tmp/workspace/cuda/results$ ls -all /tmp/workspace/cuda/results/cache/huggingface/hub/models--stabilityai--stable-diffusion-2/blobs/df895df60d360b4b18e9c05485080e3dd35691452460c7234031b64465120d5a
-rw-rw-r-- 1 delaunap delaunap 3463726498 Sep 11 10:56 /tmp/workspace/cuda/results/cache/huggingface/hub/models--stabilityai--stable-diffusion-2/blobs/df895df60d360b4b18e9c05485080e3dd35691452460c7234031b64465120d5a

@Delaunay
Copy link
Collaborator Author

Delaunay commented Sep 11, 2024

Note that the files are different on both errors

  • vae/diffusion_pytorch_model.safetensors
  • unet/diffusion_pytorch_model.safetensors

Note that unet is loaded AFTER vae so it seems like the error follows the load order

@Delaunay
Copy link
Collaborator Author

Files are all local to the machine on /tmp

@Delaunay
Copy link
Collaborator Author

Note that after a few retries it works, seems the issue is timing related.
Filelock ?

This issue seemed to have appeared on the H100 nodes, and I don't recall seeing it on A100.

@bouthilx
Copy link
Member

It's in /tmp so it would be an issue with the node's FS? 🤔 Does it make a difference if you swap which of the 2 nodes you use as the master?

@Delaunay
Copy link
Collaborator Author

Olexa suggests it might be a file open ulimit issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants