Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add zero3 module_granularity_threshold to zero optimization. #6649

Merged
merged 36 commits into from
Nov 12, 2024

Conversation

inkcherry
Copy link
Contributor

@inkcherry inkcherry commented Oct 21, 2024

This PR adds Z3 coalesced fetch to zero optimization. Currently, some logic can be reused, but it's difficult to realize that as optimization choice(I only discovered these logic when trying to implement it).

The benefit of this approach is reducing host overhead(reduce many hooks) and during the process of recursive fetching parameters (especially in fine-grained models, such as those with a large number of moe experts). This is particularly helpful for host-sensitive devices (such as hpu), where it achieved a 40% performance improvement in our customer workloads.
FYI @delock @deepcharm

@delock
Copy link
Collaborator

delock commented Oct 23, 2024

@inkcherry there are some error in CI workflows, are they related to your change?

@loadams
Copy link
Contributor

loadams commented Oct 25, 2024

@inkcherry there are some error in CI workflows, are they related to your change?

@delock - there were issues with the nv-accelerate and nv-torch workflows, but both of those should be resolved now.

@inkcherry
Copy link
Contributor Author

thanks @loadams @delock ! after add some change.
I see an error in the current CI FileNotFoundError, it seems not related to this patch.

@nelyahu
Copy link
Contributor

nelyahu commented Oct 30, 2024

@inkcherry this PR looks very promising. on which model did you benchmark the performance?

@inkcherry
Copy link
Contributor Author

inkcherry commented Oct 31, 2024

@inkcherry this PR looks very promising. on which model did you benchmark the performance?

@nelyahu The model I’m testing has 64 experts per MoE layer, with each expert containing 3 linear layers. Including the non-expert parameters, each MoE layer consists of 197 parameters (all weights, without biases). There are a total of 48 layers in the model. I think it might be similar in style to the open-source modelQwen2-MoE. Therefore, introducing a hook for each layer would incur a very high overhead.

@tjruwase tjruwase removed the request for review from awan-10 October 31, 2024 14:47
@tjruwase
Copy link
Contributor

@inkcherry, thanks for this PR. Can you clarify the difference between coalesced params and the leaf modules? I notice that this implementation relies on the leaf modules code.

@inkcherry
Copy link
Contributor Author

inkcherry commented Nov 1, 2024

thanks for the review @tjruwase ,
Currently in this patch, for modules requiring hook once, they are set to z3_leaf_module in init stage. since z3_leaf_module logic meets the requirements of avoiding recursive add hook and fetch_all_params.

I found that it is also helpful for the GPU (although not as obvious as with the HPU) in such case. I think it is suitable to add it to the comm optimization config and renamed it. because personally I think that z3_leaf_module seems more suitable as an attribute or api name. And the reduce hook overhead scenario should be one case of z3_leaf_module(Another case seems to be aimed at fixing the issue where prefetch cannot accurately predict that the parameters used in the model's forward pass may differ from those in the trace). Adding an independent switch might facilitate conditional operations in the future.

@tjruwase
Copy link
Contributor

tjruwase commented Nov 1, 2024

@inkcherry, thanks for the explanation.

I agree that avoiding recursive hook_and_fetch can benefit both functionality (e.g., MoE) and performance (e.g., communication) scenarios. The part that I am unsure about is whether these functionalities should be exposed through DeepSpeed API or ds_config. Since users need to specify model-specific modules for this feature, I prefer the API approach because I think it is more natural for model-specific details to be expressed in the client code rather than in the ds_config. Currently, the ds_config is generally model-agnostic which provides the convenience of reusing.

I will be glad to hear your thoughts. Also, can you please share some unit tests to demonstrate usage?

@inkcherry
Copy link
Contributor Author

inkcherry commented Nov 4, 2024

@tjruwase , thank you for your suggestions,Yes,I agree with your concerns. Initially, I used the config because I felt this API was difficult for users to be aware of (unless they encountered a related issue and searched in the issue tracker) or recognized the API but couldn’t determine its performance impact.(Compared to other fetch-related optimization configurations in the config,such as overlap_comm, bucket_size, etc.)

I discussed this with @delock and I modified it to an int variable that represents the size of the model granularity, indicating (the number of parameter elements/the number of required hooks ). This allows users to set it themselves, it should have reusability in the same software and hardware environment, and I think an optimal range can be derived based on the same hardware and software. I’ve currently modify the code with ut and forward to your suggestions.

@inkcherry inkcherry changed the title add zero3 coalesced parameters fetch to zero optimization. add zero3 module_granularity_threshold to zero optimization. Nov 6, 2024
@loadams loadams added this pull request to the merge queue Nov 12, 2024
Merged via the queue into microsoft:master with commit 7af3a4b Nov 12, 2024
15 checks passed
@skyshine102
Copy link

Hi @inkcherry, I'm wondering how to set the module_granularity_threshold for

  • Qwen1.5-MoE-A2.7B (4 activated over 60 routing experts)
  • Deepseek-v2-lite (4 activated over 64 routing experts).

Can you provide a heuristic to set this value?

@inkcherry
Copy link
Contributor Author

inkcherry commented Nov 14, 2024

Hi, @skyshine102 , When you enable this switch (set number > 0, regardless of whether it takes effect), it will print all module's granularity. In theory, the smallest value should appear in blocks like XXMoeSparseBlock (all experts params) or XXMoeDecoderLayer (all experts + some norm params). You only need to set a granularity greater than or equal to that printed value. If you find that the z3 hook overhead affects performance, this may be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants