Do ZeRO2 and ZeRO3 need gradient accumulation? #4118
Gy-Lu
started this conversation in
Development | Core
Replies: 1 comment
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
As we know, ZeRO2 and ZeRO3 would split the gradient, which is incompatible with gradient accumulation.
However, they are not that incompatible.
For instance, we can accumulate the gradients belonging to each rank after communication.
Drawback
This version of gradient accumulation saves no communication.
Advantage
For some users who need a large batch to train their model(for model convergence, e.g.) but have limited GPU memory, gradient accumulation can be a solution.
Beta Was this translation helpful? Give feedback.
All reactions