Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

help on understanding the implementation of FSDP. #5441

Open
jq-wei opened this issue Sep 15, 2024 · 0 comments
Open

help on understanding the implementation of FSDP. #5441

jq-wei opened this issue Sep 15, 2024 · 0 comments
Labels
pending This problem is yet to be addressed

Comments

@jq-wei
Copy link

jq-wei commented Sep 15, 2024

I don't understand the implementation of FSDP. I think in order to use FSDP, the model needs to be wrapped by FullyShardedDataParallel (or FSDP) imported from torch.distributed.fsdp. But I can not find anywhere in this repo uses these two keywords.

So I don't think FSDP is implemented natively, but rather, perhaps, some combination to zero2&3 from Deepspeed?. I am missing something?

In fact, some other issues is related to this question. Basically the model is not shared, but each GPU holds one full copy of the model. For example issue 5140 4912.

Thank you very much in advance for giving some hints.

@github-actions github-actions bot added the pending This problem is yet to be addressed label Sep 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant