Skip to content
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.

Fix the bug that the last checkpoint may not be saved properly in some cases #122

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

0x4f5da2
Copy link

If a non-master process exit before the master process enter save_checkpoint, the training process couldn't end properly. Adding torch.distributed.barrier() will fix it.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 29, 2020
Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajprateek has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants