-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
local_sgd: initial version of fault tolerant LocalSGD #47
Conversation
|
||
if self._local_step >= self._sync_every: | ||
self._local_step = 0 | ||
self._average() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the function description it is saying we allreduce
the weights, but the function implemented below allreduces the grads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allreduce_grads is a bit misnamed, I'll rename it as we're not reducing gradients here
The behavior is just an allreduce though with some rescaling logic on top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
torchft/local_sgd.py
Outdated
|
||
def step(self) -> None: | ||
""" | ||
This should be called after the optimizer step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, i wonder instead of having the user call step
after optimizer step, we modify the init of LocalSGD
to pass in the optimizer as well and then add an optimizer hook (https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.register_step_post_hook.html#torch.optim.Optimizer.register_step_post_hook) to automatically call this logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL -- that looks pretty useful
I do think we can potentially move this into the forward wrapper as well but it makes things a bit harder to reason about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ae46ea7
to
4bd104e
Compare
4bd104e
to
f132f7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool to add localSGD!
This implements an initial version of LocalSGD. #39
This has some limitations:
sync_every
steps as it's treated as a single Manager stepsync_every
stepssync_every
steps on each worker. This copy will be required for DiLoCo to compute the pseudo gradient.Test plan:
This will require more testing as I've only tested it with a mocked Manager and no e2e/integration tests