Support MoE for pipeline models #5338

mosheisland · 2024-03-30T17:14:03Z

This PR enhances DeepSpeed to support MoE for pipeline models (e.g. GPTModelPipe from Megatron-DeepSpeed).
Main changes:

Enhance expert groups creation for pipeline (enhance both flavors: DP/PP/EP and DP/TP/PP/EP)
Fix MoE save/load checkpoint for PipelineModule based models.
Display MoE loss for PipelineModule based models.
Support gradients reduce for BF16_Optimizer for PipelineModule.
Note that same commit also fixes gradients reduction error when using Megatron-DeepSpeed GPTModelPipe with BF16_Optimizer also for a dense (no MOE) model.
When using no-drop tokens, all-reduce the capacity (op=max) using expert parallel group instead of world group

mosheisland · 2024-03-30T17:30:18Z

Please note that Megatron-DeepSpeed PR#373 https://github.com/microsoft/Megatron-DeepSpeed/pull/373 is dependent on this PR.
Also, Megatron-DeepSpeed PR#373 includes multiple verification runs to test MoE functionality for pipeline models and to test no regressions to other configurations (Dense models, MoE for non pipeline models).

Signed-off-by: Moshe Island <[email protected]>

Currently MoE uses Megatron-DeepSpeed APIs to get tensor-parallel info (rank, world_size, group). In order to enable MoE for PipelineModule, modify to use backward-compatible methods that can access either Megatron, DeepSpeed Topology or Old Megatron APIs. Since MoE is not part of deepspeed runtime, move backward compatible methods to deepseed.utils and modify imports as required. Signed-off-by: Moshe Island <[email protected]>

Signed-off-by: Moshe Island <[email protected]>

Currently, only "total_loss" is displayed. If model has additional losses (e.g. MoE loss), display them as well. Similar to "total loss", additional losses are displayed for the full batch after mean reduced across DP ranks. Signed-off-by: Moshe Island <[email protected]>

Signed-off-by: Moshe Island <[email protected]>

Currently, when using no-drop tokens, we calculate locally the capacity and then all-reduce(op=max) on world group. This fails when using pipeline parallel (with micro batches), since different stage workers are handling different model layers (or at warmup, where first stage workers are processing while last stage workers are idle). Fix it by running the all-reduce on the expert group. Signed-off-by: Moshe Island <[email protected]>

This commit enhances expert group creation for both modes: - DP + PP + EP - DP + TP + PP + EP Signed-off-by: Moshe Island <[email protected]>

When using MoE with MoE-TP disabled, use pipeline parallel group to max or sum MoE gradients. This also fixes the behavior for following configuration: No pipeline, TP enabled, MoE TP disabled. Signed-off-by: Moshe Island <[email protected]>

tohtana

This is incredibly great work. Thank you for the amazing contribution!

deepspeed/runtime/pipe/engine.py

tohtana · 2024-04-04T16:55:22Z

I left a comment about a very small part but have already approved this PR.

Signed-off-by: Moshe Island <[email protected]>

mosheisland · 2024-04-05T12:14:14Z

I left a comment about a very small part but have already approved this PR.

Done

This PR enhances DeepSpeed to support MoE for pipeline models (e.g. GPTModelPipe from Megatron-DeepSpeed). Main changes: - Enhance expert groups creation for pipeline (enhance both flavors: DP/PP/EP and DP/TP/PP/EP) - Fix MoE save/load checkpoint for PipelineModule based models. - Display MoE loss for PipelineModule based models. - Support gradients reduce for BF16_Optimizer for PipelineModule.<br>Note that same commit also fixes gradients reduction error when using Megatron-DeepSpeed GPTModelPipe with BF16_Optimizer also for a dense (no MOE) model. - When using no-drop tokens, all-reduce the capacity (op=max) using expert parallel group instead of world group --------- Signed-off-by: Moshe Island <[email protected]> Co-authored-by: Moshe Island <[email protected]>

mosheisland requested review from tjruwase, mrwyattii, duli2012, awan-10, arashb and loadams as code owners March 30, 2024 17:14

tohtana self-requested a review April 3, 2024 16:25

misland-habana added 8 commits April 4, 2024 15:01

MOE: Support bf16 grads reduce for pipeline

0050fda

Signed-off-by: Moshe Island <[email protected]>

MOE: Enable save MoE checkpoint for Pipeline models

f5c4d1a

Signed-off-by: Moshe Island <[email protected]>

MOE: Fix loading checkpoint of Pipeline models

b20db80

Signed-off-by: Moshe Island <[email protected]>

MOE: Enhance expert group creation for pipeline

d8ecc22

This commit enhances expert group creation for both modes: - DP + PP + EP - DP + TP + PP + EP Signed-off-by: Moshe Island <[email protected]>

mosheisland force-pushed the moe/pipe branch 2 times, most recently from 7a5e888 to 0f9d2b5 Compare April 4, 2024 12:27

tohtana approved these changes Apr 4, 2024

View reviewed changes

deepspeed/runtime/pipe/engine.py Outdated Show resolved Hide resolved

MOE: fix style issue in pipe load_module_state_dict

b6067d7

Signed-off-by: Moshe Island <[email protected]>

mosheisland added 2 commits April 5, 2024 15:22

Merge branch 'master' into moe/pipe

526ce7f

Merge branch 'master' into moe/pipe

4d8bf27

tohtana enabled auto-merge April 7, 2024 06:13

tohtana added this pull request to the merge queue Apr 7, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 7, 2024

tohtana added this pull request to the merge queue Apr 7, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 7, 2024

loadams added this pull request to the merge queue Apr 8, 2024

Merged via the queue into microsoft:master with commit 08e0733 Apr 8, 2024
12 checks passed

mosheisland mentioned this pull request Apr 9, 2024

Support MoE for GPTModelPipe microsoft/Megatron-DeepSpeed#373

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support MoE for pipeline models #5338

Support MoE for pipeline models #5338

mosheisland commented Mar 30, 2024

mosheisland commented Mar 30, 2024

tohtana left a comment

tohtana commented Apr 4, 2024

mosheisland commented Apr 5, 2024

Support MoE for pipeline models #5338

Support MoE for pipeline models #5338

Conversation

mosheisland commented Mar 30, 2024

mosheisland commented Mar 30, 2024

tohtana left a comment

Choose a reason for hiding this comment

tohtana commented Apr 4, 2024

mosheisland commented Apr 5, 2024