You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fs = get_filesystem(path)
if fs.exists(path):
fs.rm(path, recursive=True)
log.debug(f"Removed checkpoint: {path}")
but then it tries to remove an already removed checkpoint file (I'm using xser), it crashes:
File "/usr/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd
os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tensor_479.pt'
_rmtree_safe_fd(dirfd, fullname, onerror)
File "/usr/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/usr/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd
os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tensor_479.pt'
self._run(model, ckpt_path=ckpt_path)
As you can notice, more than one process is trying to remove the same file. I think this would be just a matter of running checkpoint removal only at global rank 0 (I'm currently training using 16 nodes, with TP=8 and PP=1).
Thanks for reporting the issue. We're looking at it. So far, it appears this is from PTL torch_io.remove_checkpoint(), but we'll check if anything from neuron side can help.
We have recognized the source of the issue. Its mainly coming from this API . Here neuronx_distributed’s CheckpointIO class has not implemented the remove_checkpoint API. Hence it defaults to PyTorch-lightning’s default API which assumes DDP. You are right in the sense that all TP workers would start to delete the same file and we need to re-write the API to ensure that only one worker is deleting at a time.
We are looking into it and would have a fix in one of the upcoming releases. To unblock yourself, you can override the API and ensure only one rank (usually 0) deletes the directory whiles others can wait. Sample implementation below:
I'm training a model using the PyTorch Lightning plug-in and a limit on the number of kept models:
The problem is, when the limit defined in
save_top_k
is reached, PTL will call (at some point)lightning_fabric.plugins.io.torch_io.remove_checkpoint()
https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/fabric/plugins/io/torch_io.py#L86. This is recursively removing the files under the oldest saved checkpoint:but then it tries to remove an already removed checkpoint file (I'm using xser), it crashes:
As you can notice, more than one process is trying to remove the same file. I think this would be just a matter of running checkpoint removal only at global rank 0 (I'm currently training using 16 nodes, with TP=8 and PP=1).
Here is relevant info about my environment:
pip freeze:
Neuron libraries:
The text was updated successfully, but these errors were encountered: