Patches for remote sync waiting and log_local checkpointing #290

jeffreywpli · 2024-07-04T19:30:23Z

Three main changes (originally motivated by making it easier to chain together consecutive calls to open_lm.main)

Allow for user to specify their own custom NCCL timeout. The default is 10 minutes which may not be enough to run final remote sync in all scenarios (causing a timeout because all other processes hit a barrier).
Remove unnecessary waits during the initial "test call" of remote sync and final sync by hard-passing in 0 instead of args.sync_every into remote_sync_with_expon_backoff. These are the only instances in main where remote_sync_with_expon_backoff is called directly (as opposed to in a subprocess) and therefore causes whole script to hang when sync_every is not 0.
args.log_local perhaps does not work as originally intended, since many logging calls are still gated behind an if is_master(args) instead of if is_master(args, local=args.log_local).

…ir issue

Jeffrey added 10 commits July 2, 2024 00:58

add param

504050b

add timeout to both

4368279

fix type to datetime.timedelta

14d7210

fix typo

22e9968

fix typo in 2nd location

7ef1afe

do not sleep before first attempt at remote sync

d0f05b1

patch args.log_local

933a8ad

change rank0_only behavior when log_local

39e8aa8

fix sleep behavior only for last iteration, also address the parent d…

d8ca03c

…ir issue

0 sleep for initial sync

a857a41

jeffreywpli requested review from GeorgiosSmyrnis and achalddave July 4, 2024 19:30

linting

6f8fbab

Provide feedback