Fix the bug that the last checkpoint may not be saved properly in some cases #122

0x4f5da2 · 2020-11-29T17:48:11Z

If a non-master process exit before the master process enter save_checkpoint, the training process couldn't end properly. Adding torch.distributed.barrier() will fix it.

facebook-github-bot

@rajprateek has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

fix last ckpt failure

0691813

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 29, 2020

facebook-github-bot reviewed Dec 2, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the bug that the last checkpoint may not be saved properly in some cases #122

Fix the bug that the last checkpoint may not be saved properly in some cases #122

0x4f5da2 commented Nov 29, 2020

facebook-github-bot left a comment

Fix the bug that the last checkpoint may not be saved properly in some cases #122

Are you sure you want to change the base?

Fix the bug that the last checkpoint may not be saved properly in some cases #122

Conversation

0x4f5da2 commented Nov 29, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment