Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support MoE for pipeline models #5338

Merged
merged 11 commits into from
Apr 8, 2024
Merged

Commits on Apr 4, 2024

  1. MOE: Support bf16 grads reduce for pipeline

    Signed-off-by: Moshe Island <[email protected]>
    misland-habana committed Apr 4, 2024
    Configuration menu
    Copy the full SHA
    0050fda View commit details
    Browse the repository at this point in the history
  2. MOE: Use backward compatible methods to access tp info

    Currently MoE uses Megatron-DeepSpeed APIs to get tensor-parallel info (rank,
    world_size, group).
    
    In order to enable MoE for PipelineModule, modify to use backward-compatible
    methods that can access either Megatron, DeepSpeed Topology or Old Megatron
    APIs.
    
    Since MoE is not part of deepspeed runtime, move backward compatible methods
    to deepseed.utils and modify imports as required.
    
    Signed-off-by: Moshe Island <[email protected]>
    misland-habana committed Apr 4, 2024
    Configuration menu
    Copy the full SHA
    d04cb9c View commit details
    Browse the repository at this point in the history
  3. MOE: Enable save MoE checkpoint for Pipeline models

    Signed-off-by: Moshe Island <[email protected]>
    misland-habana committed Apr 4, 2024
    Configuration menu
    Copy the full SHA
    f5c4d1a View commit details
    Browse the repository at this point in the history
  4. MOE: Support display of MoE loss for Pipeline models

    Currently, only "total_loss" is displayed.
    If model has additional losses (e.g. MoE loss), display them as well.
    Similar to "total loss", additional losses are displayed for the full batch
    after mean reduced across DP ranks.
    
    Signed-off-by: Moshe Island <[email protected]>
    misland-habana committed Apr 4, 2024
    Configuration menu
    Copy the full SHA
    a0e8012 View commit details
    Browse the repository at this point in the history
  5. MOE: Fix loading checkpoint of Pipeline models

    Signed-off-by: Moshe Island <[email protected]>
    misland-habana committed Apr 4, 2024
    Configuration menu
    Copy the full SHA
    b20db80 View commit details
    Browse the repository at this point in the history
  6. MOE: Fix group for max capacity all-reduce

    Currently, when using no-drop tokens, we calculate locally the capacity and
    then all-reduce(op=max) on world group.
    
    This fails when using pipeline parallel (with micro batches), since different
    stage workers are handling different model layers (or at warmup, where first
    stage workers are processing while last stage workers are idle).
    
    Fix it by running the all-reduce on the expert group.
    
    Signed-off-by: Moshe Island <[email protected]>
    misland-habana committed Apr 4, 2024
    Configuration menu
    Copy the full SHA
    a46f35d View commit details
    Browse the repository at this point in the history
  7. MOE: Enhance expert group creation for pipeline

    This commit enhances expert group creation for both modes:
    - DP + PP + EP
    - DP + TP + PP + EP
    
    Signed-off-by: Moshe Island <[email protected]>
    misland-habana committed Apr 4, 2024
    Configuration menu
    Copy the full SHA
    d8ecc22 View commit details
    Browse the repository at this point in the history
  8. MOE: Update global norm calculation for pipeline

    When using MoE with MoE-TP disabled, use pipeline parallel group to max or sum
    MoE gradients.
    
    This also fixes the behavior for following configuration:
    No pipeline, TP enabled, MoE TP disabled.
    
    Signed-off-by: Moshe Island <[email protected]>
    misland-habana committed Apr 4, 2024
    Configuration menu
    Copy the full SHA
    0f9d2b5 View commit details
    Browse the repository at this point in the history

Commits on Apr 5, 2024

  1. MOE: fix style issue in pipe load_module_state_dict

    Signed-off-by: Moshe Island <[email protected]>
    misland-habana committed Apr 5, 2024
    Configuration menu
    Copy the full SHA
    b6067d7 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    526ce7f View commit details
    Browse the repository at this point in the history

Commits on Apr 7, 2024

  1. Configuration menu
    Copy the full SHA
    4d8bf27 View commit details
    Browse the repository at this point in the history