Clean up of old checkpoints is crashing #20

evellasques · 2024-04-25T07:57:19Z

I'm training a model using the PyTorch Lightning plug-in and a limit on the number of kept models:

ModelCheckpoint(
                save_top_k=args.num_kept_checkpoint,
                 monitor="global_step",
                 mode="max",
                 every_n_train_steps=args.checkpoint_freq,
                 dirpath=args.checkpoint_dir,
                 enable_version_counter=False,
             )
         )

The problem is, when the limit defined in save_top_k is reached, PTL will call (at some point) lightning_fabric.plugins.io.torch_io.remove_checkpoint() https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/fabric/plugins/io/torch_io.py#L86. This is recursively removing the files under the oldest saved checkpoint:

fs = get_filesystem(path)
        if fs.exists(path):
            fs.rm(path, recursive=True)
            log.debug(f"Removed checkpoint: {path}")

but then it tries to remove an already removed checkpoint file (I'm using xser), it crashes:

 File "/usr/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tensor_479.pt'
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/usr/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tensor_479.pt'
    self._run(model, ckpt_path=ckpt_path)

As you can notice, more than one process is trying to remove the same file. I think this would be just a matter of running checkpoint removal only at global rank 0 (I'm currently training using 16 nodes, with TP=8 and PP=1).

Here is relevant info about my environment:

pip freeze:

neuronx-cc==2.13.68.0+6dfecc895
neuronx-distributed==0.7.0
torch==1.13.1
torch-neuronx==1.13.1.1.14.0
torch-xla==1.13.1+torchneurone
transformers==4.31.0

Neuron libraries:

aws-neuronx-collectives/unknown,now 2.20.22.0-c101c322e amd64 [installed]
aws-neuronx-dkms/unknown,now 2.16.7.0 amd64 [installed]
aws-neuronx-oci-hook/unknown,now 2.3.0.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.20.22.0-1b3ca6425 amd64 [installed]
aws-neuronx-tools/unknown,now 2.17.1.0 amd64 [installed]

The text was updated successfully, but these errors were encountered:

jyang-aws · 2024-04-29T17:35:51Z

Thanks for reporting the issue. We're looking at it. So far, it appears this is from PTL torch_io.remove_checkpoint(), but we'll check if anything from neuron side can help.

aws-rhsoln · 2024-04-29T21:11:35Z

We have recognized the source of the issue. Its mainly coming from this API . Here neuronx_distributed’s CheckpointIO class has not implemented the remove_checkpoint API. Hence it defaults to PyTorch-lightning’s default API which assumes DDP. You are right in the sense that all TP workers would start to delete the same file and we need to re-write the API to ensure that only one worker is deleting at a time.
We are looking into it and would have a fix in one of the upcoming releases. To unblock yourself, you can override the API and ensure only one rank (usually 0) deletes the directory whiles others can wait. Sample implementation below:

def remove_checkpoint(self, filepath):
    if xm.get_ordinal() == 0:
        # call delete
     xm.rendezvous('Deleting checkpoint')

Feel free to submit a pull request if you believe this resolves the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up of old checkpoints is crashing #20

Clean up of old checkpoints is crashing #20

evellasques commented Apr 25, 2024

jyang-aws commented Apr 29, 2024

aws-rhsoln commented Apr 29, 2024

Clean up of old checkpoints is crashing #20

Clean up of old checkpoints is crashing #20

Comments

evellasques commented Apr 25, 2024

jyang-aws commented Apr 29, 2024

aws-rhsoln commented Apr 29, 2024