-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support MoE for pipeline models #5338
Commits on Apr 4, 2024
-
MOE: Support bf16 grads reduce for pipeline
Signed-off-by: Moshe Island <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0050fda - Browse repository at this point
Copy the full SHA 0050fdaView commit details -
MOE: Use backward compatible methods to access tp info
Currently MoE uses Megatron-DeepSpeed APIs to get tensor-parallel info (rank, world_size, group). In order to enable MoE for PipelineModule, modify to use backward-compatible methods that can access either Megatron, DeepSpeed Topology or Old Megatron APIs. Since MoE is not part of deepspeed runtime, move backward compatible methods to deepseed.utils and modify imports as required. Signed-off-by: Moshe Island <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d04cb9c - Browse repository at this point
Copy the full SHA d04cb9cView commit details -
MOE: Enable save MoE checkpoint for Pipeline models
Signed-off-by: Moshe Island <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f5c4d1a - Browse repository at this point
Copy the full SHA f5c4d1aView commit details -
MOE: Support display of MoE loss for Pipeline models
Currently, only "total_loss" is displayed. If model has additional losses (e.g. MoE loss), display them as well. Similar to "total loss", additional losses are displayed for the full batch after mean reduced across DP ranks. Signed-off-by: Moshe Island <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a0e8012 - Browse repository at this point
Copy the full SHA a0e8012View commit details -
MOE: Fix loading checkpoint of Pipeline models
Signed-off-by: Moshe Island <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b20db80 - Browse repository at this point
Copy the full SHA b20db80View commit details -
MOE: Fix group for max capacity all-reduce
Currently, when using no-drop tokens, we calculate locally the capacity and then all-reduce(op=max) on world group. This fails when using pipeline parallel (with micro batches), since different stage workers are handling different model layers (or at warmup, where first stage workers are processing while last stage workers are idle). Fix it by running the all-reduce on the expert group. Signed-off-by: Moshe Island <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a46f35d - Browse repository at this point
Copy the full SHA a46f35dView commit details -
MOE: Enhance expert group creation for pipeline
This commit enhances expert group creation for both modes: - DP + PP + EP - DP + TP + PP + EP Signed-off-by: Moshe Island <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d8ecc22 - Browse repository at this point
Copy the full SHA d8ecc22View commit details -
MOE: Update global norm calculation for pipeline
When using MoE with MoE-TP disabled, use pipeline parallel group to max or sum MoE gradients. This also fixes the behavior for following configuration: No pipeline, TP enabled, MoE TP disabled. Signed-off-by: Moshe Island <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0f9d2b5 - Browse repository at this point
Copy the full SHA 0f9d2b5View commit details
Commits on Apr 5, 2024
-
MOE: fix style issue in pipe load_module_state_dict
Signed-off-by: Moshe Island <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b6067d7 - Browse repository at this point
Copy the full SHA b6067d7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 526ce7f - Browse repository at this point
Copy the full SHA 526ce7fView commit details
Commits on Apr 7, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 4d8bf27 - Browse repository at this point
Copy the full SHA 4d8bf27View commit details