Patches for remote sync waiting and log_local checkpointing #290
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Three main changes (originally motivated by making it easier to chain together consecutive calls to open_lm.main)
Allow for user to specify their own custom NCCL timeout. The default is 10 minutes which may not be enough to run final remote sync in all scenarios (causing a timeout because all other processes hit a barrier).
Remove unnecessary waits during the initial "test call" of remote sync and final sync by hard-passing in 0 instead of args.sync_every into
remote_sync_with_expon_backoff
. These are the only instances inmain
whereremote_sync_with_expon_backoff
is called directly (as opposed to in a subprocess) and therefore causes whole script to hang when sync_every is not 0.args.log_local perhaps does not work as originally intended, since many logging calls are still gated behind an
if is_master(args)
instead ofif is_master(args, local=args.log_local)
.