Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up of old checkpoints is crashing #20

Open
evellasques opened this issue Apr 25, 2024 · 2 comments
Open

Clean up of old checkpoints is crashing #20

evellasques opened this issue Apr 25, 2024 · 2 comments

Comments

@evellasques
Copy link

I'm training a model using the PyTorch Lightning plug-in and a limit on the number of kept models:

ModelCheckpoint(
                save_top_k=args.num_kept_checkpoint,
                 monitor="global_step",
                 mode="max",
                 every_n_train_steps=args.checkpoint_freq,
                 dirpath=args.checkpoint_dir,
                 enable_version_counter=False,
             )
         )

The problem is, when the limit defined in save_top_k is reached, PTL will call (at some point) lightning_fabric.plugins.io.torch_io.remove_checkpoint() https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/fabric/plugins/io/torch_io.py#L86. This is recursively removing the files under the oldest saved checkpoint:

fs = get_filesystem(path)
        if fs.exists(path):
            fs.rm(path, recursive=True)
            log.debug(f"Removed checkpoint: {path}")

but then it tries to remove an already removed checkpoint file (I'm using xser), it crashes:

 File "/usr/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tensor_479.pt'
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/usr/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tensor_479.pt'
    self._run(model, ckpt_path=ckpt_path)

As you can notice, more than one process is trying to remove the same file. I think this would be just a matter of running checkpoint removal only at global rank 0 (I'm currently training using 16 nodes, with TP=8 and PP=1).

Here is relevant info about my environment:

pip freeze:

neuronx-cc==2.13.68.0+6dfecc895
neuronx-distributed==0.7.0
torch==1.13.1
torch-neuronx==1.13.1.1.14.0
torch-xla==1.13.1+torchneurone
transformers==4.31.0

Neuron libraries:

aws-neuronx-collectives/unknown,now 2.20.22.0-c101c322e amd64 [installed]
aws-neuronx-dkms/unknown,now 2.16.7.0 amd64 [installed]
aws-neuronx-oci-hook/unknown,now 2.3.0.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.20.22.0-1b3ca6425 amd64 [installed]
aws-neuronx-tools/unknown,now 2.17.1.0 amd64 [installed]
@jyang-aws
Copy link

Thanks for reporting the issue. We're looking at it. So far, it appears this is from PTL torch_io.remove_checkpoint(), but we'll check if anything from neuron side can help.

@aws-rhsoln
Copy link
Contributor

We have recognized the source of the issue. Its mainly coming from this API . Here neuronx_distributed’s CheckpointIO class has not implemented the remove_checkpoint API. Hence it defaults to PyTorch-lightning’s default API which assumes DDP. You are right in the sense that all TP workers would start to delete the same file and we need to re-write the API to ensure that only one worker is deleting at a time.
We are looking into it and would have a fix in one of the upcoming releases. To unblock yourself, you can override the API and ensure only one rank (usually 0) deletes the directory whiles others can wait. Sample implementation below:

def remove_checkpoint(self, filepath):
    if xm.get_ordinal() == 0:
        # call delete
     xm.rendezvous('Deleting checkpoint')

Feel free to submit a pull request if you believe this resolves the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants